---
title: "Enriching `glm` objects"
author: "[Ioannis Kosmidis](https://www.ikosmidis.com)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: enrichwith.bib
vignette: >
  %\VignetteIndexEntry{enriching glm objects}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

The [**enrichwith**](https://github.com/ikosmidis/enrichwith) R
package provides the `enrich` method to enrich list-like R objects
with new, relevant components. The resulting objects preserve their
class, so all methods associated with them still apply.

This vignette is a demo of the available enrichment options for `glm`
objects.

# Clotting data set

The following data set is provided in @mccullagh:89 [Section 8.4] and
consists of observations on $n = 18$ mean clotting times of blood in
seconds (`time`) for each combination of nine percentage
concentrations of normal plasma (`conc`) and two lots of clotting
agent (`lot`).
```{r, echo = TRUE, eval = TRUE}
clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100,
                                5,10,15,20,30,40,60,80,100),
                       time = c(118, 58, 42, 35, 27, 25, 21, 19, 18,
                                69, 35, 26, 21, 18, 16, 13, 12, 12),
                       lot = factor(c(rep(1, 9), rep(2, 9))))
```

@mccullagh:89 [Section 8.4] fitted a series of nested generalized
linear models assuming that the times are realizations of independent
[Gamma](https://en.wikipedia.org/wiki/Gamma_distribution) random
variables whose mean varies appropriately with concentration and
lot. In particular, @mccullagh:89 linked the inverse mean of the gamma
random variables to the linear predictors `~ 1`, `~ log(conc)`, `~
log(conc) + lot`, `~ log(conc) * lot` and carried out an analysis of
deviance to conclude that the `~ log(u) * lot` provides the best
explanation of clotting times. The really close fit of that model to
the data can be seen in the figure below.
```{r, echo = TRUE, eval = TRUE}
library("ggplot2")
clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
alpha <- 0.01
pr_out <- predict(clottingML, type = "response", se.fit = TRUE)
new_data <- clotting
new_data$time <- pr_out$fit
new_data$type <- "fitted"
clotting$type <- "observed"
all_data <- rbind(clotting, new_data)
new_data <- within(new_data, {
    low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit
    upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit
}) |> merge(all_data, all = TRUE)
ggplot(all_data) +
    geom_point(aes(conc, time, col = type), alpha = 0.8) +
    geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) +
    facet_grid(. ~ lot) +
    theme_bw() +
    theme(legend.position = "top")
```

# Key quantities in likelihood inference
## Score function

The [score function](`r URLencode("https://en.wikipedia.org/wiki/Score_(statistics)")`)
is the gradient of the log-likelihood and is a key object for
likelihood inference.

The **enrichwith** R package provides methods for the enrichment of
`glm` objects with the corresponding score function. This can either
be done by enriching the `glm` object with `auxiliary_functions` and
then extracting the score function
```{r, echo = TRUE, eval = TRUE}
library("enrichwith")
enriched_clottingML <- enrich(clottingML, with = "auxiliary functions")
scores_clottingML <- enriched_clottingML$auxiliary_functions$score
```
or directly using the `get_score_function` convenience method
```{r, echo = TRUE, eval = TRUE}
scores_clottingML <- get_score_function(clottingML)
```

By definition, the score function has to have zero components when evaluated at the
maximum likelihood estimates (only numerically zero here).
```{r, echo = TRUE, eval = TRUE}
scores_clottingML()
```

## Information matrix

Another key quantity in likelihood inference is the [expected
information](https://en.wikipedia.org/wiki/Fisher_information).


The `auxiliary_functions` enrichment option of the `enrich` method
enriches a `glm` object with a function for the evaluation of the
expected information.
```{r, echo = TRUE, eval = TRUE}
info_clottingML <- enriched_clottingML$auxiliary_functions$information
```
and can also be computed directly using the `get_information_function` method
```{r, echo = TRUE, eval = TRUE}
info_clottingML <- get_information_function(clottingML)
```

One of the uses of the expected information is the calculation of
standard errors for the model parameters. The `stats::summary.glm`
function already does that for `glm` objects, estimating the
dispersion parameter (if any) using Pearson residuals.
```{r, echo = TRUE, eval = TRUE}
summary_clottingML <- summary(clottingML)
```

Duly, `info_clottingML` returns (numerically) the same standard errors
as the `summary` method does.
```{r, echo = TRUE, eval = TRUE}
summary_std_errors <- coef(summary_clottingML)[, "Std. Error"]
einfo <- info_clottingML(dispersion = summary_clottingML$dispersion)
all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)
```

Another estimate of the standard errors results by the [observed
information](https://en.wikipedia.org/wiki/Observed_information),
which is the negative [Hessian
matrix](https://en.wikipedia.org/wiki/Hessian_matrix) of the
log-likelihood.

At least at the time of writing the current vignette, there appears
to be no general implementation of the observed information for `glm`
objects. I guess the reason for that is the dependence of the observed
information on higher-order derivatives of the inverse link and
variance functions, which are not readily available in base R.

**enrichwith** provides options for the enrichment of `link-glm` and
`family` objects with such derivatives (see `?enrich.link-glm` and
`?enrich.family` for details), and, based on those allows the
enrichment of `glm` objects with a function to compute the observed
information.

The observed and the expected information for the regression
parameters coincide for GLMs with canonical link, like `clottingML`
```{r, echo = TRUE, eval = TRUE}
oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed")
all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])
```
which is not generally true for a `glm` with a non-canonical link, as
seen below.
```{r, echo = TRUE, eval = TRUE}
clottingML_log <- update(clottingML, family = Gamma("log"))
summary_clottingML_log <- summary(clottingML_log)
info_clottingML_log <- get_information_function(clottingML_log)
einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected")
oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed")
round(einfo_log, 3)
round(oinfo_log, 3)
```

## Score tests

We can now use `scores_clottingML` and `info_clottingML` to carry out
[a score test](https://en.wikipedia.org/wiki/Score_test) to compare
nested models.

If $l_\psi(\psi, \lambda)$ is the gradient of the log-likelihood with
respect to $\psi$ evaluated at $\psi$ and $\lambda$, $i^{\psi\psi}(\psi,
\lambda)$ is the $(\psi, \psi)$ block of the inverse of the expected
information matrix, and $\hat\lambda_\psi$ is the maximum likelihood
estimator of $\lambda$ for fixed $\psi$, then, assuming that the model
is adequate
\[
l_\psi(\psi, \hat\lambda_\psi)^\top i^{\psi\psi}(\psi,\hat\lambda_\psi) l_\psi(\psi, \hat\lambda_\psi)
\]
has an asymptotic $\chi^2_{dim(\psi)}$ distribution.

The following code chunk computes the maximum likelihood estimates
when the parameters for `lot` and `log(conc):lot` are fixed to zero
($\hat\lambda\psi$ for $\psi = 0$ above), along with the score and
expected information matrix evaluated at them
($l(\psi,\hat\lambda_\psi) and $i(\psi,\hat\lambda_\psi)$, above).
```{r, echo = TRUE, eval = TRUE}
clottingML_nested <- update(clottingML, . ~ log(conc))
enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion")
coef_full <- coef(clottingML)
coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full))
coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean")
coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short
disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion")
scores <- scores_clottingML(coef_hypothesis, disp_hypothesis)
info <- info_clottingML(coef_hypothesis, disp_hypothesis)
```
The object `enriched_clottingML_nested` inherits from `enriched_glm`,
`glm` and `lm`, and, as illustrated above, **enrichwith** provides a
corresponding `coef` method to extract the estimates for the mean
regression parameters and/or the estimate for the dispersion
parameter.

The score statistic is then
```{r, echo = TRUE, eval = TRUE}
(score_statistic <- drop(scores%*%solve(info)%*%scores))
```
which gives a p-value of
```{r, echo = TRUE, eval = TRUE}
pchisq(score_statistic, 2, lower.tail = FALSE)
```

For comparison, the Wald statistic for the same hypothesis is
```{r, echo = TRUE, eval = TRUE}
coef_full[3:4]%*%solve(solve(info)[3:4, 3:4])%*%coef_full[3:4]
```
and the log-likelihood ratio statistic is
```{r, echo = TRUE, eval = TRUE}
(deviance(clottingML_nested) - deviance(clottingML))/disp_hypothesis
```
which is close to the score statistic


## Simulating from `glm` objects at parameter values

**enrichwith** also provides the `get_simulate_function` method for
`glm` or `lm` objects. The `get_simulate_function` computes a function
to simulate response vectors at /arbitrary/ values of the model
parameters, which can be useful when setting up simulation experiments
and for various inferential procedures (e.g. indirect inference).

For example, the following code chunk simulates three response vectors
at the maximum likelihood estimates of the parameters for `clottingML`.
```{r, echo = TRUE, eval = TRUE}
simulate_clottingML <- get_simulate_function(clottingML)
simulate_clottingML(nsim = 3, seed = 123)
```
The result is the same to what the `simulate` method returns
```{r, echo = TRUE, eval = TRUE}
simulate(clottingML, nsim = 3, seed = 123)
```
but `simulate_clottingML` can also be used to simulate at any given
parameter value
```{r, echo = TRUE, eval = TRUE}
coefficients <- c(0, 0.01, 0, 0.01)
dispersion <- 0.001
samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)
```
The empirical means and variances based on `samples` agree with the   exact means and variances at `coefficients` and `dispersion`
```{r, echo = TRUE, eval = TRUE}
means <- 1/(model.matrix(clottingML) %*% coefficients)
variances <- dispersion * means^2
max(abs(rowMeans(samples) - means))
max(abs(apply(samples, 1, var) - variances))
```

# pmodel, dmodel, qmodel
**enrichwith** also provides the `get_dmodel_function`, `get_pmodel_function` and `get_qmodel_function` methods, which can be used to evaluate densities or probability mass functions, distribution functions, and quantile functions, respectively at arbitrary parameter values and data settings.

For example, the following code chunk computes the density at the observations in the `clotting` data set under then maximum likelihood fit
```{r, echo = TRUE, eval = TRUE}
cML_dmodel <- get_dmodel_function(clottingML)
cML_dmodel()
```
We can also compute the densities at user-supplied parameter values
```{r, echo = TRUE, eval = TRUE}
cML_dmodel(coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
```
or even at different data points
```{r, echo = TRUE, eval = TRUE}
new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2)))
cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
```
We can also compute cumulative probabilities and quantile functions. So
```{r, echo = TRUE, eval = TRUE}
cML_qmodel <- get_qmodel_function(clottingML)
cML_pmodel <- get_pmodel_function(clottingML)
probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
```
returns `new_data$time`.

For example the fitted densities when `(conc, lot)` is `c(15, 1)`, `c(15, 2)`, `c(40, 1)` or `c(40, 2)` are
```{r, echo = TRUE, eval = TRUE}
new_data <- expand.grid(conc = c(15, 40),
                        lot = factor(1:2),
                        time = seq(0, 50, length = 500))
new_data$density <- cML_dmodel(new_data)
ggplot(data = new_data) + 
     geom_line(aes(time, density)) + facet_grid(conc ~ lot) +
     theme_bw() +
     geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)
```

The dashed vertical lines are the observed values of time for the `(conc, lot)` combinations in the figure.

# `enriched_glm`
**enrichwith** provides a wrapper to the `glm` interface that results directly in `enriched_glm` objects that carry all the components described above. For example,
```{r, echo = TRUE, eval = TRUE}
enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
names(enriched_clottingML$auxiliary_functions)
enriched_clottingML$score_mle
enriched_clottingML$expected_information_mle
enriched_clottingML$observed_information_mle
enriched_clottingML$bias_mle
enriched_clottingML$dispersion_mle
```

# Links to other resources

The vignette
[bias](https://CRAN.R-project.org/package=enrichwith/vignettes/bias.html)
of **enrichwith** describes how to use the `auxiliary_functions` to
reduce the bias of the maximum likelihood estimator for generalized
linear models.

# References