---
title: "`ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{`ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(subsampling)
```

This vignette introduces the usage of `ssp.relogit`. The statistical theory and
algorithms in this implementation can be found in relevant reference papers.

The logistic regression log-likelihood function is

$$
\max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i -
\log\left(1 + e^{\beta^\top x_i}\right) \right].
$$

## Terminology

- Full dataset: The whole dataset used as input.

- Full data estimator: The estimator obtained by fitting the model on the full
  dataset.

- Subsample: A subset of observations drawn from the full dataset.

- Subsample estimator: The estimator obtained by fitting the model on the
  subsample.

- Subsampling probability ($\pi$): The probability assigned to each observation
  for inclusion in the subsample.

- Rare events: Observations where $Y=1$ (positive instances).

- Non-rare events: Observations where $Y=0$ (negative instances).

The idea of subsampling methods is as follows: instead of fitting the model on
the size $N$ full dataset, a subsampling probability is assigned to each
observation and a smaller, informative subsample is drawn. The model is then
fitted on the subsample to obtain an estimator with reduced computational cost.

In the face of logistic regression with rare events, @wang2021nonuniform shows
that the available information ties to the number of positive instances instead
of the full data size. Based on this insight, one can keep all the rare
instances and perform subsampling on the non-rare instances to reduce the
computational cost.

## Example

We introduce the basic usage by using `ssp.relogit` with simulated data. $X$
contains $d=6$ covariates drawn from multinormal distribution and $Y$ is the
binary response variable. The full data size is $N = 2 \times 10^4$. Denote
$N_{1}=sum(Y)$ as the counts of rare observations and $N_{0} = N - N_{1}$ as the
counts of non-rare observations.

```{r}
set.seed(2)
N <- 2 * 1e4
beta0 <- c(-6, -rep(0.5, 6))
d <- length(beta0) - 1
X <- matrix(0, N, d)
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
print(paste('N: ', N))
print(paste('sum(Y): ', sum(Y)))
n.plt <- 200
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
```

## Key Arguments

The function usage is

```{r, eval = FALSE}
ssp.relogit(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  criterion = "optL",
  likelihood = "logOddsCorrection",
  control = list(...),
  contrasts = NULL,
  ...
)
```

The core functionality of `ssp.glm` revolves around three key questions:

- How are subsampling probabilities computed? (Controlled by the `criterion`
  argument)

- How is the subsample drawn?

- How is the likelihood adjusted to correct for bias? (Controlled by the
  `likelihood` argument)

Different from `ssp.glm` which can choose `withReplacement` and `poisson` as the
option of `sampling.method`, `ssp.relogit` uses `poisson` as default sampling
method. `poisson` stands for drawing subsamples one by one by comparing the
subsampling probability with a realization of uniform random variable
$U(0,1)$. The actual size of drawn subsample is random but the expected size is
$n.ssp$.

### `criterion`

The choices of `criterion` include `optA`, `optL`(default), `LCC` and
`uniform`. The optimal subsampling criterion `optA` and `optL` are derived by
minimizing the asymptotic covariance of subsample estimator, proposed by
@wang2018optimal. `LCC` and `uniform` are baseline methods.

Note that for rare observations $Y=1$ in the full data, the sampling
probabilities are $1$. For non-rare observations, the sampling probabilities
depend on the choice of `criterion`.

### `likelihood`

The available choices for `likelihood` include `weighted` and
`logOddsCorrection`(default). Both of these likelihood functions can derive an
unbiased estimator. Theoretical results indicate that `logOddsCorrection` is
more efficient than `weighted` in the context of rare events logistic
regression. See @@wang2021nonuniform.

## Results

After drawing subsample, `ssp.relogit` utilizes `survey::svyglm` to fit the
model on the subsample, which eventually uses `glm`. Arguments accepted by
`svyglm` can be passed through `...` in `ssp.glm`.

Below is an example demonstrating the use of `ssp.relogit`.

```{r}
n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.relogit(formula = formula,
                           data = data,
                           n.plt = n.plt,
                           n.ssp = n.ssp,
                           criterion = 'optA',
                           likelihood = 'logOddsCorrection'
                           )
```

## Outputs

The returned object contains estimation results and indices of drawn subsample
in the full dataset.

```{r}
names(ssp.results)
```

Some key returned variables:

- `index.plt` and `index` are the row indices of drawn pilot subsamples and
optimal subsamples in the full data.

- `coef.ssp` is the subsample estimator for $\beta$ and `coef` is the linear
combination of `coef.plt` (pilot estimator) and `coef.ssp`.

- `cov.ssp` and `cov` are estimated covariance matrices of `coef.ssp` and
`coef`. 

```{r}
summary(ssp.results)
```

In the printed results, `Expected Subsample Size` is the sum of rare event
counts ($N_{1}$) and the expected size of negative subsample drawn from $N_{0}$
non-rare observations. 
`Actual Subsample Size` is the sum of $N_{1}$ and the actual size of negative
subsample from $N_{0}$ non-rare observations.

The coefficients and standard errors printed by `summary()` are `coef` and the
square root of `diag(cov)`.

## References