---
title: "Comparing Distributions with the distfreereg Package"
output:
  rmarkdown::html_vignette:
    number_sections: true 
vignette: >
  %\VignetteIndexEntry{Comparing Distributions with the distfreereg Package}
  %\VignetteEngine{knitr::rmarkdown_notangle}
  %\VignetteEncoding{UTF-8}
header-includes:
  - \def\eqdef{\mathrel{\raise0.4pt\hbox{$:$}\hskip-2pt=}}
  - \def\ev{{\rm E}}
  - \def\given{\mathrel|}
  - \def\proglang#1{\textsf{#1}}
  - \def\class#1{\textsf{#1}}
  - \def\pkg#1{\textbf{#1}}
linkcolor: blue
link-citations: true
csl: american-statistical-association.csl
bibliography: bibliography.bib
---


\def\Secref#1{\hyperref[sec:#1]{Section~\ref{sec:#1}}}
\def\eqdef{\mathrel{\raise0.4pt\hbox{$:$}\hskip-2pt=}}

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, out.width = "49%", fig.dim = c(6,5),
                      fig.align = "center")
is_check <- ("CheckExEnv" %in% search()) ||
  any(c("_R_CHECK_TIMINGS_", "_R_CHECK_LICENSE_") %in% names(Sys.getenv()))
knitr::opts_chunk$set(eval = !is_check)

library(distfreereg)
```







# Comparing Empirical and Theoretical Distributions of Statistics
\label{sec:compare}

Because the procedure implemented by `distfreereg()` is distribution-free only
asymptotically, it is important to explore how well the approximations perform
at finite sample sizes.  How large the sample size must be for the asymptotic
behavior to be sufficiently achieved is dependent on the details of each
application. To facilitate the exploration of sample size requirements,
**distfreereg** provides the function `compare()`.

In the following example, we explore the required sample size when the true mean
is $$f(X;\theta) = \theta_1 + \theta_2X_1 + \theta_3X_2,$$ $\theta=(2,5,-1)$,
and the errors are independent standard normal errors. The covariate matrix is
generated randomly. We start with a sample size of 10.

```{r comp_1, message = FALSE}
set.seed(20240913)
n <- 10
func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2]
theta <- c(2, 5, -1)
X <- matrix(rexp(2*n, rate = 1), nrow = n)
comp_dfr <- compare(theta = theta, true_mean = func, test_mean = func,
                    true_X = X, true_covariance = list(Sigma = 3), X = X,
                    covariance = list(Sigma = 3),
                    theta_init = rep(1, length(theta)))
```

This function generates a response vector `Y` from `true_mean` using the given
values of `true_X`, `true_covariance` (by default, the errors are multivariate
normal), and `theta`. It then passes `Y` and most of the remaining arguments to
`distfreereg()`. The statistics and p-values corresponding to each simulated
data set are saved, and the process is repeated. The results are returned in an
object of class `compare`.




# Plotting the Results

These results can be explored graphically. Several options are available. The
default behavior of the `compare` method for `plot()` produces a
comparison of estimated cumulative distribution functions (CDFs) for the
observed and simulated statistics. If the sample size is sufficiently large,
these two curves should be nearly identical.

```{r cdf}
plot(comp_dfr)
```

This plot gives some cause for concern.

Another way of exploring this uses quantile--quantile (Q--Q) plots. Below is a
Q--Q plot that compares observed and simulated statistics. If the sample size is
sufficiently large, these points should lie on the line $y=x$.

```{r qq}
plot(comp_dfr, which = "qq")
```

The curvature indicates an insufficient sample size. Similar to this is the next
plot, which is a Q--Q plot comparing p-values to uniform quantiles. Once again,
if the sample size is sufficiently large, these points should lie on the line
$y=x$.

```{r qqp}
plot(comp_dfr, which = "qqp")
```

These plots all indicate that this sample size is insufficient for the
asymptotic behavior for the KS statistic. Similar plots can be made for the CvM
statistic. More details on this plot method are discussed [in the plotting
vignette](../doc/v3_plotting.html).


# Formal Comparison Testing

The `compare` method for `ks.test()`, which compares the observed and
simulated distributions, confirms that they are still different:

```{r ks.test}
ks.test(comp_dfr)
```

The default statistic is whatever is the first statistic appearing in the
supplied object:

```{r}
names(comp_dfr$obs)
ks.test(comp_dfr, stat = "CvM")
```




# Increasing the Sample Size

Let us repeat the simulation with a sample size of 100.
```{r comp_larger_n, message = FALSE}
n_2 <- 100
X_2 <- matrix(rexp(2*n_2, rate = 1), nrow = n_2)
comp_dfr_2 <- update(comp_dfr, true_X = X_2, X = X_2)
```

The following plots indicate that this sample size is sufficient.

```{r qqplots_2}
plot(comp_dfr_2)
plot(comp_dfr_2, which = "qq")
plot(comp_dfr_2, which = "qqp")
```

A formal test confirms this assessment.

```{r ks.test_2}
ks.test(comp_dfr_2)
```



# Using a `distfreereg` Object: `asymptotics()`

When a `distfreereg` object has already been created, the function
`asymptotics()` provides a convenient wrapper for `compare()` to explore sample
size adequacy. See the function's documentation for details.



# Power Considerations

The examples using `compare()` above all illustrate behavior of the function
when the null hypothesis is true. It is also important to investigate the
behavior when it is false to learn about the test's power against particular
alternative hypotheses. Using the example in \Secref{compare}, consider the case
in which the true mean function is quadratic in $X_2$: $f(x;\theta) = \theta_0 +
\theta_1x_1 + \theta_2x_2 + x_2^2$. Having verified above that a sample size of
100 is sufficient for the asymptotic behavior to be sufficiently approximated,
this can be investigated as follows.

```{r power, message = FALSE}
alt_func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2] + 0.5*X[,2]^2
comp_dfr_3 <- update(comp_dfr_2, true_mean = alt_func, true_covariance = list(Sigma = 3))
```

```{r}
plot(comp_dfr_3)
```

As hoped, these curves are clearly different. Power estimates can be obtained
with `rejection()`.

```{r}
rejection(comp_dfr_3)
```

Power for different significance levels can be found by changing the `alpha`
argument.

```{r}
rejection(comp_dfr_3, alpha = 0.01)
```


# Warnings

The test implemented by `distfreereg()` is distribution-free, but simulating
values requires specifying an error distribution. The default behavior of
`compare()` is to use multivariate normal errors, or the default behavior of
`simulate()`. There is no way to avoid specifying a distribution to run these
simulations. Alternative error distributions can be specified, however, when
`true_mean` is a `function`, an `nls` object, or a `formula` with `method` equal
to "`nls`".

The safe interpretation of these simulations is limited to the error
distribution that is used.