---
title: "An introduction to M-estimation with `geex`"
author: "B. Saul"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to geex}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: stefanski2002
  author:
  - family: Stefanski
    given: Len A.  
  - family: Boos
    given: Dennis D.
  container-title: The calculus of M-estimation
  type: article-journal
  URL: 'http://www.jstor.org/stable/3087324'
  DOI: 10.1198/000313002753631330
  journal: The American Statistician
  issued:
    year: 2002 
  volume: 56
  number: 1
  pages: 29-38
---

```{r, echo = FALSE, message = FALSE, warning=FALSE}
library(geex)
library(knitr)
opts_knit$set(progress = TRUE, verbose = TRUE)
```

## From M-estimation math to code

In the basic set-up, M-estimation applies to estimators of the $p \times 1$ parameter $\theta=(\theta_1, \theta_2, \dots, \theta_p)^{\intercal}$ which can be obtained as solutions to an equation of the form

\begin{equation}
\label{eq:psi}
\sum_{i = 1}^m \psi(O_i, \theta) = 0,
\end{equation}

\noindent where $O_1, \ldots, O_m$ are $m$ independent and identically distributed (iid) random variables, and the function $\psi$ returns a vector of length $p$ and does not depend on $i$ or $m$ [@stefanski2002, hereafter SB]. See SB for the case where the $O_i$ are independent but not necessarily identically distributed. The roots of (1) are referred to as M-estimators and denoted by $\hat \theta$. M-estimators can be solved for analytically in some cases or computed numerically in general. Under certain regularity conditions, the asymptotic properties of $\hat{\theta}$ can be derived from Taylor series approximations, the law of large numbers, and the central limit theorem. In brief, let $\theta_0$ be the true parameter value defined by $\int \psi(o, \theta_0) dF(o) = 0$, where $F$ is the distribution function of $O$. Let $\dot{\psi}(o, \theta) = {\partial \psi(O_i, \theta)/\partial \theta}^{\intercal}$, $A(\theta_0) = E[-\dot{\psi}(O_1, \theta_0)]$, and $B(\theta_0) = E[\psi(O_1, \theta_0)\psi(O_1, \theta_0)^{\intercal}]$. Then under suitable regularity assumptions, $\hat{\theta}$ is consistent and asymptotically Normal, i.e., 

\begin{equation*}
\label{eq:variance}
\sqrt{m}(\hat{\theta} - \theta_0) \overset{d}{\to} N(0, V(\theta_0))  \text{ as } m \to \infty,
\end{equation*}

\noindent where $V(\theta_0) = A(\theta_0)^{-1} B(\theta_0)  \{A(\theta_0)^{-1} \}^{\intercal}$. The sandwich form of $V(\theta_0)$ suggests several possible large sample variance estimators. For some problems, the analytic form of $V(\theta_0)$ can be derived and estimators of $\theta_0$ and other unknowns simply plugged into $V(\theta_0)$. 
Alternatively, $V(\theta_0)$ can be consistently estimated by the empirical sandwich variance estimator, where the expectations in $A(\theta)$ and $B(\theta)$ are replaced with their empirical counterparts. Let $A_i = - \dot{\psi}(O_i, \theta)|_{\theta = \hat{\theta}}$, $A_m = m^{-1} \sum_{i = 1}^m A_i$, $B_i = \psi(O_i, \hat{\theta}) \psi(O_i, \hat{\theta})^{\intercal}$, and $B_m = m^{-1} \sum_{i = 1}^m B_i$. The empirical sandwich estimator of the variance of $\hat{\theta}$ is:

\begin{equation}
\label{eq:esve}
\hat{\Sigma} = A_m^{-1} B_m \{A_m^{-1}\}^{\intercal}/m .
\end{equation}

The `geex` package provides an application programming interface (API) for carrying out M-estimation. The analyst provides a function, called `estFUN`, corresponding to $\psi(O_i, \theta)$ that maps data $O_i$ to a function of $\theta$. Numerical derivatives approximate $\dot{\psi}$ so that evaluating $\hat{\Sigma}$ is entirely a computational exercise. No analytic derivations are required from the analyst. 

Consider estimating the population mean $\theta = E[Y_i]$ using the sample mean $\hat{\theta} = m^{-1} \sum_{i=1}^m Y_i$ of $m$ iid random variables $Y_1, \dots, Y_m$. The estimator $\hat{\theta}$  can be expressed as the solution to the following estimating equation:

\[
\sum_{i = 1}^m (Y_i - \theta) = 0.
\] 

\noindent which is equivalent to solving \eqref{eq:psi} where $O_i = Y_i$ and $\psi(O_i, \theta) = Y_i - \theta$. An `estFUN` is a translation of $\psi$ into a function whose first argument is `data` and returns a function whose first argument is `theta`.  An `estFUN` corresponding to the estimating equation for the sample mean of $Y$ is:

```
meanFUN <- function(data){ function(theta){ data$Y - theta } } .
``` 

If an estimator fits into the above framework, then the user need only specify `estFUN`. No other programming is required to obtain point and variance estimates. The remaining sections provide examples of translating $\psi$ into an `estFUN`.

## Calculus of M-estimation Examples

The three examples of M-estimation from SB are presented here for demonstration. For these examples, the data are $O_i = \{Y_{1i}, Y_{2i}\}$ where $Y_1 \sim N(5, 16)$ and $Y_2 \sim N(2, 1)$ for $m = 100$ and are included in the `geexex` dataset. 

### Example 1: Sample moments

The first example estimates the population mean ($\theta_1$) and variance ($\theta_2$) of $Y_1$. The solution to the estimating equations below are the sample mean $\hat{\theta}_1 = m^{-1} \sum_{i = 1}^m Y_{1i}$ and sample variance $\hat{\theta}_2 = m^{-1} \sum_{i = 1}^m (Y_{1i} - \hat{\theta}_1)^2$.


\[
\psi(Y_{1i}, \theta) = 
 \begin{pmatrix}
  Y_{1i} - \theta_1 \\
  (Y_{1i} - \theta_1)^2 - \theta_2
 \end{pmatrix}
\]

```{r SB1_estfun, echo=TRUE, results='hide'}
SB1_estfun <- function(data){
  Y1 <- data$Y1
  function(theta){
      c(Y1 - theta[1],
       (Y1 - theta[1])^2 - theta[2])
  }
}
```

The primary `geex` function is `m_estimate`, which requires two inputs: `estFUN` (the $\psi$ function), `data` (the data frame containing $O_i$ for $i = 1, \dots, m$). The package defaults to `rootSolve::multiroot` or estimating roots of \eqref{eq:psi}, though the solver algorithm can be specified in the `root_control` argument. Starting values for `rootSolve::multiroot` are passed via the `root_control` argument; e.g., 

```{r SB1_run, echo=TRUE, eval=TRUE, message=FALSE}
library(geex)
results <- m_estimate(
    estFUN = SB1_estfun, 
    data   = geexex,
    root_control = setup_root_control(start = c(1,1)))
```

```{r SB1_clsform, echo=TRUE, eval = TRUE}
n <- nrow(geexex)
A <- diag(1, nrow = 2)

B <- with(geexex, {
  Ybar <- mean(Y1)
  B11 <- mean( (Y1 - Ybar)^2 )
  B12 <- mean( (Y1 - Ybar) * ((Y1 - Ybar)^2 - B11) )
  B22 <- mean( ((Y1 - Ybar)^2 - B11)^2 )
  matrix(
    c(B11, B12,
      B12, B22), nrow = 2
  )
})

# closed form roots
theta_cls <- c(mean(geexex$Y1),
               # since var() divides by n - 1, not n
               var(geexex$Y1) * (n - 1)/ n ) 

# closed form sigma
Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / n

comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))
```

The `m_estimate` function returns an object of the `S4` class `geex`, which contains an `estimates` slot and `vcov` slot for $\hat{\theta}$ and $\hat{\Sigma}$, respectively. These slots can be accessed by the functions `coef` (or `roots`) and `vcov`. The point estimates obtained for $\theta_1$ and $\theta_2$ are analogous to the base R functions `mean` and `var` (after multiplying by $m-1/m$ for the latter). SB gave a closed form for $A(\theta_0)$ (an identity matrix) and $B(\theta_0)$ (not shown here) and suggest plugging in sample moments to compute $B(\hat{\theta})$. The point and variance estimates using both `geex` and the analytic solutions are shown below}. The maximum absolute difference between either the point or variance estimates is `r format(max(abs(comparison$geex$estimates - comparison$cls$estimates), abs(comparison$geex$vcov - comparison$cls$vcov)), digits = 1)`, thus demonstrating excellent agreement between the numerical results obtained from `geex` and the closed form solutions for this set of estimating equations and data. 

```{r SB1_results, echo = FALSE}
comparison
```

### Example 2: Ratio estimator

This example calculates a ratio estimator and illustrates the delta method via M-estimation. The estimating equations target the means of $Y_1$ and $Y_2$, labelled $\theta_1$ and $\theta_2$, as well as the estimand $\theta_3 = \theta_1/ \theta_2$.  

\[
\psi(Y_{1i}, Y_{2i}, \theta) = 
\begin{pmatrix}
Y_{1i} - \theta_1 \\
Y_{2i} - \theta_2 \\
\theta_1 - \theta_3\theta_2
\end{pmatrix}
\]

\noindent The solution to \eqref{eq:psi} for this $\psi$ function yields $\hat{\theta}_3 = \bar{Y}_1 / \bar{Y}_2$, where $\bar{Y}_j$ denotes the sample mean of $Y_{j1}, \ldots,Y_{jm}$ for $j=1,2$.

```{r SB2_eefun, echo = TRUE}
SB2_estfun <- function(data){
  Y1 <- data$Y1; Y2 <- data$Y2
  function(theta){
      c(Y1 - theta[1],
        Y2 - theta[2],
        theta[1] - (theta[3] * theta[2]) 
    )
  }
}
```

```{r SB2_run, echo = TRUE, message = FALSE}
results <- m_estimate(
    estFUN = SB2_estfun, 
    data  = geexex, 
    root_control = setup_root_control(start = c(1, 1, 1)))
```

```{r SB2_clsform, echo = TRUE}
# Comparison to an analytically derived sanwich estimator
A <- with(geexex, {
 matrix(
  c(1 , 0, 0,
    0 , 1, 0,
    -1, mean(Y1)/mean(Y2), mean(Y2)),
  byrow = TRUE, nrow = 3)
})

B <- with(geexex, {
  matrix(
    c(var(Y1)   , cov(Y1, Y2), 0,
      cov(Y1, Y2), var(Y2)   , 0,
      0, 0, 0),
    byrow = TRUE, nrow = 3)
})

## closed form roots
theta_cls <- c(mean(geexex$Y1), mean(geexex$Y2))
theta_cls[3] <- theta_cls[1]/theta_cls[2]
## closed form covariance
Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / nrow(geexex)
comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))
```

SB gave closed form expressions for $A(\theta_0)$ and $B(\theta_0)$, into which we plug in appropriate estimates for the matrix components and compare to the results from `geex`. The point estimates again show excellent agreement (maximum absolute difference `r format(max(abs(comparison$geex$estimates - comparison$cls$estimates)), digits = 2)`), while the covariance estimates differ by the third decimal (maximum absolute difference `r format(max(abs(comparison$geex$vcov - comparison$cls$vcov)), scientific = TRUE, digits = 2)`). 

```{r SB2_results, echo = TRUE}
comparison
```

### Example 3: Delta method continued

This example extends Example 1 to again illustrate the delta method. The estimating equations target not only the mean ($\theta_1$) and variance ($\theta_2$) of $Y_1$, but also the standard deviation ($\theta_3$) and the log of the variance ($\theta_4$) of $Y_1$.

\[
\psi(Y_{1i}, \mathbf{\theta}) = 
\begin{pmatrix}
Y_{1i} - \theta_1 \\
(Y_{1i} - \theta_1)^2 - \theta_2 \\
\sqrt{\theta_2} - \theta_3 \\
\log(\theta_2) - \theta_4
\end{pmatrix}
\]

```{r SB3_eefun, echo = TRUE, warning = FALSE, message=FALSE}
SB3_estfun <- function(data){
  Y1 <- data$Y1
  function(theta){
      c(Y1 - theta[1],
       (Y1 - theta[1])^2 - theta[2],
       sqrt(theta[2]) - theta[3],
       log(theta[2]) - theta[4])
  }
}
```

```{r SB3_run, echo = FALSE, message=FALSE}
results <- m_estimate(
   estFUN= SB3_estfun, 
   data  = geexex,
   root_control = setup_root_control(start = rep(2, 4, 4, 4)))
```

```{r SB3_clsform, echo = TRUE}
## closed form roots
theta_cls <- numeric(4)
theta_cls[1] <- mean(geexex$Y1)
theta_cls[2] <- sum((geexex$Y1 - theta_cls[1])^2)/nrow(geexex)
theta_cls[3] <- sqrt(theta_cls[2])
theta_cls[4] <- log(theta_cls[2])

## Compare to closed form ##
theta2 <- theta_cls[2]
mu3 <- moments::moment(geexex$Y1, order = 3, central = TRUE)
mu4 <- moments::moment(geexex$Y1, order = 4, central = TRUE)
## closed form covariance
Sigma_cls <- matrix(
  c(theta2, mu3, mu3/(2*sqrt(theta2)), mu3/theta2,
    mu3, mu4 - theta2^2, (mu4 - theta2^2)/(2*sqrt(theta2)), (mu4 - theta2^2)/theta2,
    mu3/(2 * sqrt(theta2)), (mu4 - theta2^2)/(2*sqrt(theta2)), (mu4 - theta2^2)/(4*theta2), (mu4 - theta2^2)/(2*theta2^(3/2)),
    mu3/theta2, (mu4 - theta2^2)/theta2, (mu4 - theta2^2)/(2*theta2^(3/2)), (mu4/theta2^2) - 1) ,
  nrow = 4, byrow = TRUE) / nrow(geexex)
## closed form covariance
# Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / n
comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))
```

SB again provided a closed form for $A(\theta_0)$ and $B(\theta_0)$, which we compare to the `geex` results. The maximum absolute difference between `geex` and the closed form estimates for both the parameters and the covariance is `r format(max(abs(comparison$geex$estimates - comparison$cls$estimates), abs(comparison$geex$vcov - comparison$cls$vcov)), digits = 2)`. 

```{r SB3_results, echo = FALSE}
comparison
```

# References
