---
title: Binomial Regression with ciTools
author: Matthew Avery
date: 11 October 2017
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Binomial Regression with ciTools}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Logistic regression with binomial data

Logistic regression is most commonly used with a Bernoulli
response. If $Y \sim Bernoulli(p)$, then $P(Y = 1) = p = 1 - P(Y =
0)$. Logistic regression can be used to estimate $p|\textbf{x}$
where $\textbf{x}$ is a collection of predictor variables. This is
accomplished by assuming the relationship

$$log(\frac{p}{1-p}) = \textbf{x}\beta$$

and then choosing $\beta$ to maximize the joint likelihood:

$$\Pi_i p_i^{y_i}(1-p_i)^{(1-y_i)}$$

Note that the function linking the predictor variables to $p$ is
called the logit function, which is what gives logistic regression
its name. In $\textbf{R}$, this model can be fit using `glm` by
specifying the options `family = "binomial"`.

The binomial distribution is a generalization of the Bernoulli,
typically conceptualized as describing the number of "successes"
(here denoted $y$) out of some number of attempts or trials (here
denoted $n$). The probability mass function for a binomial random
variable, $Y$, is

 $$P(Y = y|n, p) = {x \choose n} p^x(1 - p)^{n - x}$$

From this, we can see that by setting $n = 1$, we recover the Bernoulli PMF.

Since the binomial distribution is a generalization of the
Bernoulli distribution, it stands to reason that it may be possible
to generalize logistic regression for Bernoulli variables to
binomial variables. Indeed, this is the case, and it turns out it
is relatively easy. Using the logit link function again, maximize
the joint likelihood:

 $$\Pi_i {y_i \choose n_i} p_i^{y_i}(1-p_i)^{(1-y_i)}$$

## Binomial regression in R

R is also capable of fitting binomial regression models using `glm`
with `family = "binomial"`. However, there are a few syntactic
quirks. First, in the model statement, the response variable must
be given as a proportion of successes for a given run. This can be
done either by providing the proportions as a column in the data
matrix or by specifying a ratio of the number of successes to the
number of attempts. An example here might be instructive.

```{r message = F}
set.seed(20171011)
library(dplyr)
library(ciTools)
```

```{r}
df <- data.frame(
  x = runif(30, -1, 1),
  n = rbinom(30, 6, .8)) %>%
  mutate(y = rbinom(30, n, (exp(x)/(1 + exp(x)))))
```

Consider the data set, `df`, given below.

```{r }
head(df)
```

Here, $x$ is some predictor variable related to the number of
successes, $y$, for the given number of attempts, $n$. We can fit a
logistic regression to this data to estimate this relationship:

```{r }
glm(y/n ~ x, data = df, family = "binomial", weights = n)
```

We get the same results using different syntax:

```{r }
dfProb <- mutate(df, prob = y/n)
head(dfProb)
glm(prob ~ x, data = dfProb, family = "binomial", weights = n)
```

Notably, these fitted values are equivalent to what we would get if
we were to transform each of our binomial responses into equivalent
sets of Bernoulli trials. For example, the first observation in
`df` is a single success out of 5 attempts with a covariate of
`r round(df$x[1], 3)`.  Another way to think of this is five Bernoulli
trials, one successful, four unsuccessful, all with a covariate
value of `r round(df$x[1], 3)`.

```{r echo = F}
get_box <- function(myRow){
  data.frame(x = rep(myRow$x, myRow$n),
             y = c(rep(1, myRow$y), rep(0, myRow$n - myRow$y)))
}

out <- NULL
for(i in 1:nrow(df)){
  out <- bind_rows(out, get_box(df[i,]))
}

dfTall <- out
```

For example:

```{r }
head(dfTall)
glm(y ~ x, data = dfTall, family = "binomial")
```

Note that while the degrees of freedom, information criteria,
etc. are different, the coefficient estimates are the same, as are
the standard errors for these coefficients.

##Using `ciTools` for binomial regression

`ciTools` supports logistic regression with both Bernoulli and
binomial response variables. For both types, `add_ci` has intuitive
and expected functionality. It produces a point estimate and
confidence intervals around the estimated probability of
success. However, the other functions of `ciTools` (`add_pi`,
`add_quantile`, and `add_probs`) produce different behaviors,
because prediction intervals, quantiles, and probability
values are not very useful for Bernoulli variables.

For example, consider prediction intervals. These are used to
quantify the observation-to-observation variability and estimate its
range. For a Bernoulli variable, only two values are possible, so
this is not particularly meaningful -- the prediction interval will
always be $[0,1]$. Quantiles are similarly problematic. And
probability estimates (that is, $P(Y>y|x)$) are
equivalent to estimates of $p$ or $1-p$ . Thus, asking for any of
these will produce either an error (in the case of
`add_pi.glm(family =
"binomial)` or `add_quantiles(family = "binomial)`) or a warning
(in the case of `add_probs.glm(family = "binomial")`).

For a binomial response, on the other hand, each of these functions
produce meaningful output. Binomial response variables can take
more than two values, making it sensible to consider prediction
intervals, quantiles, or probabilities. For example, it may be interesting
to consider the 90 percent quantile for a binomial regression:

```{r echo = F}
fit <- glm(y/n ~ x, data = df, family = "binomial", weights = n)
```



```{r }
head(add_quantile(df, fit, p = 0.9))
```

There is a lot to unpack here, including three warning
messages. Let's consider the output first. Unlike when we use
`add_ci.glm`, the `pred` column includes the predicted values
$E(Y|\boldsymbol{x})$ rather than the estimated probability of success
$\hat{p}|\boldsymbol{x}$.

Next, there are three warnings. None of these indicate that
anything has gone wrong. Rather, they are included to ensure that
users understand the output they're given. The first warning makes
it clear to the user that the `weights` argument is assumed to
indicate that they're doing binomial rather than weighted Bernoulli
regression. The second warning informs users that the estimated
interval is approximate. This is because the current method used by
`ciTools` for GLMs is based on simulation, and because `ciTools`
forces quantile estimates to lie in the support set of the response
distribution. The final warning points out that the `pred` column
refers to the fitted values rather than estimated probability of
success, which was mentioned in the previous paragraph.

## Summary

Logistic regression can be used for both Bernoulli and Binomial
response variables, and `glm` supports both. `ciTools`
differentiates between the two based on whether the user has
included a column in the `weights` argument. Aside from `add_ci`,
none of the functions in `ciTools` produce useful information for
Bernoulli response variables. For Binomial response variables, all
of these functions produce useful information, though error
messages are included to ensure that users understand the output
presented.
