---
title: Introduction to the empirical Bayes normal means model via shrinkage estimation
output: 
  rmarkdown::html_vignette:
    toc: yes
bibliography: ebnm.bib
vignette: >
  %\VignetteIndexEntry{Introduction to the empirical Bayes normal means model via shrinkage estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#", collapse = TRUE, results = "hold",
                      fig.align = "center", dpi = 90)
```

## The normal means model and empirical Bayes

Given $n$ observations $x_i$ with known standard deviations $s_i > 0$, $i
= 1, \dots, n$, the normal means model
[@Robbins51; @efron1972limiting; @Stephens_NewDeal; @bhadra2019lasso;
@Johnstone; @lei-thesis] has
\begin{equation}
x_i \overset{\text{ind.}}{\sim} \mathcal{N}(\theta_i, s_i^2),
\end{equation}
where the unknown ("true") means $\theta_i$ are the quantities
to be estimated. Here and throughout, we use
$\mathcal{N}(\mu, \sigma^2)$ to denote the normal distribution with
mean $\mu$ and variance $\sigma^2$.

The empirical Bayes (EB) approach to inferring $\theta_i$ attempts to
improve upon the maximum-likelihood estimate $\hat{\theta}_i = x_i$ by "borrowing
information" across observations, exploiting the fact that each
observation contains information not only about its respective mean,
but also about how the means are collectively distributed
[@Robbins56; @Morris; @Efron_Book; @Stephens_NewDeal].  Specifically,
the empirical Bayes normal means (EBNM) approach assumes that
\begin{equation}
\theta_i \overset{\text{ind.}}{\sim} g \in \mathcal{G},
\end{equation}
where $\mathcal{G}$ is some family of distributions that is
specified in advance and $g \in \mathcal{G}$ is estimated using the data. 

The EBNM model is fit by first using
all of the observations to estimate the prior $g \in \mathcal{G}$, and then using
the estimated distribution $\hat{g}$ to compute posteriors and/or posterior summaries 
for the "true" means 
$\theta_i$. Commonly, $g$ is estimated via maximum-likelihood and posterior means
are used as point estimates for the unknown means. The **ebnm** 
package provides a unified interface for efficiently carrying out both steps,
with a wide range of available 
options for the prior family $\mathcal{G}$.

For a detailed introduction, see our [JSS paper][ebnm-paper]. For further
background, see for example [John Storey's book][storey-book]. 

## An illustration: shrinkage estimation

Our example data set consists of 400 data points simulated from a normal means
model in which the true prior $g$ is a mixture of (a) a normal
distribution centered at 2 and (b) a point-mass also centered
at 2:

$$
\theta_i \sim 0.8\delta_2 + 0.2 N(2,1)
$$

First, we simulate the "true" means $\theta_i$ from this prior:

```{r sim-data}
set.seed(1)
n <- 400
u <- 2 + (runif(n) < 0.2) * rnorm(n)
```

Next, we simulate the observed means $x_i$ as "noisy" estimates of the
true means (in this example, the noise is homoskedastic):

$$
x_i \sim N(\theta_i,s_i), \quad s_i = 1/3,
$$

```{r sim-data-2}
s <- rep(1/3, n)
x <- u + s * rnorm(n)
```

Although we know what the true means are in this example, 
we'll treat them as quantities we cannot observe.

The maximum-likelihood estimates (MLEs) of the true means are simply
$\hat{u}_i = x_i$:

```{r plot-mle, fig.height=3.5, fig.width=3.5}
par(mar = c(4, 4, 2, 2))
lims <- c(-0.55, 5.05)
plot(u, x, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "MLE")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")
```

We can do better than the MLE --- and in
fact some theory tells us we are guaranteed to do better --- by learning a
prior using all the observations, then "shrinking" the estimates toward
this prior.

Let's illustrate this idea with a simple normal prior in which the
mean and variance of the normal prior are learned from the
data. (Note that the normal prior is the wrong prior for this
data set! Recall we that simulated data using a mixture of a normal and a
point-mass.)

First, we fit the prior:

```{r ebnm-normal}
library("ebnm")
fit_normal <- ebnm(x, s, prior_family = "normal", mode = "estimate")
```

Next we estimate the true means using posterior means
$\hat{u}_i = E[\theta_i \,|\, x_i, \hat{g}]$. We extract these posterior means 
using the `coef()` method:

```{r plot-ebnm-normal, fig.height=3.5, fig.width=3.5}
y <- coef(fit_normal)
par(mar = c(4, 4, 2, 2))
plot(u, y, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "normal prior")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")
```

These "shrunken" estimates are better when true means $\theta_i$ are near 2, but
worse when they are far from 2. Still, they substantially improve
the *overall estimation error* (the "root mean-squared error" or RMSE):

```{r mse-1}
err_mle           <- (x - u)^2
err_shrink_normal <- (y - u)^2
print(round(digits = 4,
            x = c(mle           = sqrt(mean(err_mle)),
                  shrink_normal = sqrt(mean(err_shrink_normal)))))
```

Here's a more detailed comparison of the estimation error:

```{r plot-mse-1, fig.height=3.5, fig.width=3.5}
par(mar = c(4, 4, 2, 2))
plot(err_mle, err_shrink_normal, pch = 4, cex = 0.75,
     xlim = c(0, 1.2), ylim = c(0, 1.2))
abline(a = 0, b = 1, col = "magenta", lty = "dotted")
```

Indeed, the error increases in a few of the estimates and decreases in
many of the other estimates, resulting in a lower RMSE over the 400 data points.

Let's now see what happens when we use a family of priors that is
better suited to this data set --- specifically, the "point-normal"
family. Notice that the only change we make in our call to
`ebnm()` is in the `prior_family` argument:

```{r ebnm-pn}
fit_pn <- ebnm(x, s, prior_family = "point_normal", mode = "estimate")
```

Now we extract the posterior mean estimates and compare to the true
values:

```{r plot-ebnm-pn, fig.height=3.5, fig.width=3.5}
par(mar = c(4, 4, 2, 2))
y <- coef(fit_pn)
plot(u, y, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "point-normal prior")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")
```

The added flexibility of the point-normal prior improves the accuracy of estimates for
means near 2, while estimates for means far from 2 are no worse than the 
MLEs. The result is that the overall RMSE again sees a substantial improvement:

```{r mse-2}
err_shrink_pn <- (y - u)^2
print(round(digits = 4,
            x = c(mle = sqrt(mean(err_mle)),
                  normal = sqrt(mean(err_shrink_normal)),
                  point_normal = sqrt(mean(err_shrink_pn)))))
```

## Session information

The following R version and packages were used to generate this vignette:

```{r}
sessionInfo()
```

## References

[ebnm-paper]: https://doi.org/10.18637/jss.v114.i03
[storey-book]: https://jdstorey.org/fas/