---
title: "On the exactness of permutation tests"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{On the exactness of permutation tests}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(dplyr)
library(ggplot2)
library(purrr)
library(tidyr)
library(flipr)
load("../R/sysdata.rda")
```

In this article, we discuss the exactness property of permutation tests,
which is closely related to how $p$-values are computed from the
permutations. This article is a summary of the paper by @phipson2010.

## Traditional formulation of a statistical test

A statistical test aims at determining whether some observed data can be
considered to be strong evidence in favor of a so-called *alternative*
hypothesis $H_a$ compared to a so-called *null* hypothesis $H_0$. To do
that, a test statistic $T$ is defined such that:

-   its observed value under $H_0$ can be computed once you observed
    some data;
-   large values of the statistic are evidence in favor of $H_a$.
-   you know (an approximation of) the distribution of $T$ under the
    null hypothesis, also known as the **null distribution**.

Once such a test statistic is available and we observe some data, we can
denote by $t_\mathrm{obs}$ the value of the test statistic computed from
the observed data and define the so-called **p-value** as the null
hypothesis tail probability:
$$ p_\infty = \mathbb{P}_{H_0} \left( T \ge t_\mathrm{obs} \right). $$The
p-value $p_\infty$ is by definition uniformly distributed on $(0,1)$
under the null hypothesis. Hence, we can define the so-called
**significance level** $\alpha \in (0,1)$ and decide to reject $H_0$ in
favor of $H_1$ when $p_\infty \le \alpha$. By doing this, the
probability of wrongly rejecting $H_0$, also known as the probability of
type I errors, is simply:
$$ \mathbb{P}_{H_0} \left( p_\infty \le \alpha \right) = \alpha. $$The
significance level $\alpha$ therefore matches by design the probability
of type I errors, which means that choosing $\alpha$ allows to control
the probability of type I errors. We say that the test is **exact**.

When the null distribution of $T$ is not known, you do not have access
to $p_\infty$. A possible solution is then to resort to resampling
techniques to approach the null hypothesis. Once you get an approximate
null distribution, the question is how do you compute a p-value that
provides an exact statistical test. There are two approaches to this
problem: the first estimates the true p-value $p_\infty$ from the
approximate null distribution while the second proposes an alternative
definition of the p-value that can be straightforwardly computed. Let us
expand on both approaches.

## The two-sample problem in a permutational framework

We start with two samples
$X_1, \dots, X_{n_x} \stackrel{iid}{\sim} \mathcal{D}(\theta_x)$ and
$Y_1, \dots, Y_{n_y} \stackrel{iid}{\sim} \mathcal{D}(\theta_y)$. We
want to know whether the two distributions are the same or not on the
basis of the two samples we collected. In this parametric setting, it
boils down to testing the following hypotheses:
$$ H_0: \theta_x = \theta_y \quad \mbox{vs} \quad \theta_x \neq \theta_y. $$Let
$T$ be a statistic that depends on the two samples which is suited for
elucidating this test, i.e.:

-   you can compute its observed value under the null hypothesis once
    you observed some data;
-   large values of the statistic are evidence in favor of the
    alternative hypothesis.

Now, for performing the test, one should also know (an approximation of)
the distribution of $T$ under the null hypothesis, also known as the
**null distribution**, from which the p-value associated to this test
can be computed for instance.

If one knows the exact null distribution, then there is no need to
resort to permutations. However, if the null distribution is not known,
permutations come in handy for approaching it.

The idea is that, under the null hypothesis, the two samples come from
the same distribution. Hence, we can pull them together as one big
sample of size $n = n_x + n_y$ generated by that common distribution. At
this point, we can split the pooled sample into two random subsets of
size $n_x$ and $n_y$ respectively, and use them to compute a value of
the statistic $T$. If we repeat many times this splitting strategy, say
$m$ times, we end up with $m$ values of the statistic from which we can
compute the empirical distribution, known as the **permutation
distribution**, which approaches the null distribution.

## Permutation p-value as an unbiased estimator of $p_\infty$

Let $t_\mathrm{obs}$ be the value of the statistic computed from the
original two samples, $m$ be the number of permutations used to approach
the null distribution and $B$ be a random variable that counts the
number of test statistic values greater than or equal to
$t_\mathrm{obs}$.

By definition, the random variable $B$ follows a binomial distribution
of size $m$ and rate of success $p_\infty$. Hence, we can define the
following **unbiased estimator** of $p_\infty$:
$$ \widehat{p_\infty} = \frac{B}{m}. $$

However, when one uses this estimator of the p-value for the purpose of
hypothesis testing, the resulting test is not exact. Let us investigate
why.

First, remember that the true p-value $p_\infty$ is a random variable
itself, in the sense that its value changes as soon as $t_\mathrm{obs}$
changes i.e. each time the whole experiment is reconducted. Hence, the
probability of wrongly rejecting the null hypothesis using
$\widehat{p_\infty}$ reads:
$$ \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \int_\mathbb{R} \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) f_{p_\infty}(p) dp = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) dp, $$because
$p_\infty$ is uniformly distributed on $(0,1)$ under the null
hypothesis.

Next, notice that $\widehat{p_\infty}$ can only take on a finite set of
values
$\left\{ 0, \frac{1}{m}, \frac{2}{m}, \dots, \frac{m-1}{m}, 1 \right\}$.
Hence, we have for any $b \in 0, 1, \dots, m$:
$$ \mathbb{P} \left( \widehat{p_\infty} = \frac{b}{m} \right) = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} = \left. \frac{b}{m} \right| p \right) dp = \int_0^1 \binom{m}{b} p^b (1-p)^{m-b} dp = \frac{1}{m + 1}. $$

We can therefore deduce that:
$$ \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \frac{\lfloor m \alpha \rfloor + 1}{m + 1} \neq \alpha. $$

The following R code shows graphically that using $\widehat{p_\infty}$
as p-value does not provide an exact test:

```{r fig.width=6, out.width="100%"}
alpha <- seq(0.01, 0.1, by = 0.01)
m <- c(10, 100, 1000)
p1 <- crossing(alpha, m) %>% 
  mutate(
    p = (floor(m * alpha) + 1) / (m + 1), 
    mf = paste("m =", m)
  ) %>% 
  ggplot(aes(alpha, p, color = mf)) + 
  geom_point() + 
  geom_abline(aes(intercept = 0, slope = 1)) + 
  labs(
    x = "Significance level",
    y = "Probability of wrongly rejecting H0"
  ) + 
  facet_wrap(vars(mf)) + 
  scale_color_viridis_d() + 
  scale_y_continuous(limits = c(0, 0.1)) + 
  coord_equal() + 
  theme_bw()
fig <- p1 %>% 
  plotly::ggplotly() %>% 
  plotly::hide_legend()
htmlwidgets::saveWidget(
  widget = fig, 
  file = "exactness-fig1.html", 
  selfcontained = rmarkdown::pandoc_available("1.12.3")
)
htmltools::tags$iframe(
  src = "exactness-fig1.html",
  scrolling = "no", 
  seamless = "seamless",
  frameBorder = "0",
  width = "100%", 
  height = 400
)
```

## Permutation p-value as the tail probability of a resampling distribution

An alternative strategy is to define the p-value by looking at the
random variable $B$ instead of the test statistic $T$. While the p-value
is the tail probability of the null distribution, in the context of
randomization tests, we can define it as the tail probability of the
distribution of $B$. Given a fixed number $m$ of sampled permutations,
recall that the random variable $B$ counts the number of test statistic
values larger than or equal to $t_\mathrm{obs}$. Hence, an alternative
equivalent definition of the p-value is given by the so-called **exact
permutation p-value**:
$$ p_e = \mathbb{P}_{H_0} \left( B \le b \right), $$where $b$ is the
observed number of test statistics larger than or equal to
$t_\mathrm{obs}$ (using the observed sample of permutations that was
drawn).

Let $B_t$ be a random variable that counts the total number of possible distinct
test statistic values exceeding $t_\mathrm{obs}$ and recall that $m_t$ is the
total number of possible distinct permutations. We denote by $$ p_t = \frac{B_t
+ 1}{m_t + 1}, $$the permutation p-value when the exhaustive list of all
permutations is used.

As we have seen before, it is straightforward to show that $B_t$ follows
a discrete uniform distribution on the integers $0, \dots, m_t$ and
that, conditional on $B_t = b_t$, the random variable $B$ follows a
binomial distribution of size $m$ and rate of success $p_t$. We can thus
write:
$$ p_e = \sum_{b_t=0}^{B_t} \mathbb{P}_{H_0} \left( B \le b | B_t = b_t \right) \mathbb{P}_{H_0} \left( B_t = b_t \right) = \frac{1}{m_t + 1} \sum_{b_t=0}^{B_t} F_B \left( b; m, \frac{b_t + 1}{m_t + 1} \right), $$
where $F_B \left( \cdot; m, \frac{b_t + 1}{m_t + 1} \right)$ is the cumulative
probability function of the binomial distribution of size $m$ and
probability of success $\frac{b_t + 1}{m_t + 1}$.

This can be computationally intense to compute for large values of $m_t$, in
which case one might use the following integral approximation:
$$ p_e \approx \frac{b+1}{m+1} - \int_0^{\frac{0.5}{m_t+1}} F_B (b; m, p_t) dp_t. $$

This approximation shows that the exact p-value $p_e$ is upper bounded by
$$
p_u = \frac{b+1}{m+1},
$$
which happens to be the exact p-value in the case of sampling permutations without replacement.

## Comparison by empirical evidence

In **flipr**, you perform a permutation test using the estimator
$\widehat{p_\infty}$ of the p-value $p_\infty$ by setting
`type == "estimate"`. This provides a non-exact test but an unbiased
estimate of $p_\infty$. You perform a permutation test using the
permutation p-value $p_e$ by setting `type == "exact"`. This provides an
exact test. You also have the possibility of using the upper bound p-value $p_u$ using `type == "upper_bound"`.

The following R code runs simulations to empirically estimate the
probability of wrongly rejecting the null hypothesis. The generative
model for both samples is the standard normal distribution. Sample sizes
are set to $n_1 = n_2 = 5$. We draw $20$ permutations for each test. We used a significance level of $5\%$.

```{r, eval=FALSE}
# General setup
nreps <- 1e4
n1 <- 5
n2 <- 5
set.seed(12345)
sim <- map(sample.int(.Machine$integer.max, nreps, replace = TRUE), ~ {
    list(
      x = rnorm(n = n1, mean = 0, sd = 1),
      y = rnorm(n = n2, mean = 0, sd = 1),
      s = .x
    )
  })

# Cluster setup
cl <- makeCluster(detectCores(logical = FALSE))
clusterEvalQ(cl, {
  library(tidyverse)
  library(flipr)
  null_spec <- function(y, parameters) {
    map(y, ~ .x - parameters)
  }
  stat_functions <- list(stat_t)
  stat_assignments <- list(delta = 1)
  nperms <- 20
  alpha <- 0.05
})

alpha_estimates <- pbapply::pblapply(sim, function(.l) {
    pf <- PlausibilityFunction$new(
      null_spec = null_spec,
      stat_functions = stat_functions,
      stat_assignments = stat_assignments,
      .l$x, .l$y,
      seed = .l$s
    )
    pf$set_nperms(nperms)
    pf$set_alternative("left_tail")
    pf$set_pvalue_formula("exact")
    pv_exact <- pf$get_value(0)
    pf$set_pvalue_formula("upper_bound")
    pv_upper_bound <- pf$get_value(0)
    pf$set_pvalue_formula("estimate")
    pv_estimate <- pf$get_value(0)
    c(
      exact       = pv_exact       <= alpha,
      upper_bound = pv_upper_bound <= alpha,
      estimate    = pv_estimate    <= alpha
    )
  }, cl = cl) %>%
  transpose() %>%
  simplify_all() %>%
  map(mean)
stopCluster(cl)
```

```{r}
as_tibble(alpha_estimates)
```

This simulation provides numerical evidence that using the permutation p-value
$p_e$ yields an exact test. This is by design since $p_e$ is a genuine
probability hence it is uniformly distributed on $(0,1)$ under the null
hypothesis. We also observe that the upper bound $p_u$ is very close to the exact p-value $p_e$.

Of course, one might criticize this simulation in which the number of permutations $m$ has been chosen small on purpose to prove our point. In effect, when $m$ is larger, all formulae to compute the p-value yield an asymptotically exact test. Quoting @phipson2010, *while this is true, the urgent need to avoid small $m$ disappears when the exact p-value $p_e$ is used in place of $\widehat{p_\infty}$, because $p_e$ ensures a valid statistical test regardless of the sample sizes or the number of permutations. When exact p-values are used, the only penalty for choosing $m$ small is a loss of statistical power to reject the null hypothesis. This is as it should be: more permutations should generally provide greater power*.

## References
