---
title: "Estimating Ordinal Regression Models with rstanarm"
author: "Jonah Gabry and Ben Goodrich"
date: "`r Sys.Date()`"
output: 
  html_vignette: 
    toc: yes
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{stan_polr: Ordinal Models}
-->
```{r, child="children/SETTINGS-knitr.txt"}
```
```{r, child="children/SETTINGS-gg.txt"}
```

# Introduction

This vignette explains how to estimate models for ordinal outcomes using the
`stan_polr` function in the __rstanarm__ package.

```{r, child="children/four_steps.txt"}
```

Steps 3 and 4 are covered in more depth by the vignette entitled ["How to Use the
__rstanarm__ Package"](rstanarm.html). This vignette focuses on Step 1.

One of the strengths of doing MCMC with Stan --- as opposed to a Gibbs sampler 
--- is that reparameterizations are essentially costless, which allows the user
to specify priors on parameters that are either more intuitive, numerically
stable, or computationally efficient without changing the posterior 
distribution of the parameters that enter the likelihood. Advantageous 
parameterizations are already built into the Stan programs used in the 
__rstanarm__ package, so it is just a matter of using these vignettes to 
explain how the priors work in the context of these reparameterizations.

# Likelihood

Ordinal outcomes fall in one of $J$ categories. One way to motivate an ordinal
model is to introduce a latent variable, $y^\ast$, that is related to the 
observed outcomes via an observation mechanism:
$$y=\begin{cases}
1 & \mbox{if }y^{\ast}<\zeta_{1}\\
2 & \mbox{if }\zeta_{1}\leq y^{\ast}<\zeta_{2}\\
\vdots\\
J & \mbox{if }\zeta_{J-1}\leq y^{\ast}
\end{cases},$$
where $\boldsymbol{\zeta}$ is a vector of cutpoints of length $J-1$.

Then $y^\ast$ is modeled as a linear function of $K$ predictors 
$$y^\ast = \mu + \epsilon = \mathbf{x}^\top \boldsymbol{\beta} + \epsilon,$$
where $\epsilon$ has mean zero and unit scale but can be specified as being
drawn from one of several distributions. Note that there is no "intercept"
in this model since the data cannot distinguish an intercept from the 
cutpoints. However, if $J = 2$, then $\zeta_1$ can be referred to as either
the cutpoint or the intercept.

A Bayesian can treat $y^\ast$ as another unknown parameter, although for
computational efficiency the Stan code essentially integrates each $y^\ast$
out of the posterior distribution, leaving the posterior distribution of 
$\boldsymbol{\beta}$ and $\boldsymbol{\zeta}$. Nevertheless, it is useful to 
motivate the model theoretically as if $y^\ast$ were just an unknown 
parameter with a distribution truncated by the relevant element(s) of
$\boldsymbol{\zeta}$.

# Priors

If $y^\ast$ were observed we would simply have a linear regression model
for it, and the description of the priors in the vignette entitled ["Estimating
Linear Models with the __rstanarm__ Package"](lm.html) would apply directly. Another way
to say the same thing is _conditional_ on a realization of $y^\ast$, we have a
linear regression model and the description of the priors in the other [vignette](lm.html)
does apply (and should be read before continuing with this subsection).

The `stan_lm` function essentially specifies a prior on 
$\boldsymbol{\theta} = \mathbf{R}^{-1} \boldsymbol{\beta}$, where $\mathbf{R}$
is the upper triangular matrix in the QR decomposition of the design matrix,
$\mathbf{X} = \mathbf{Q} \mathbf{R}$. Furthermore, in `stan_lm`, 
$\sigma_{\epsilon} = \sigma_y \sqrt{1 - R^2}$ where $R^2$ is the proportion of
variance in the outcome that is attributable to the coefficients in a linear 
model.

The main difference in the context of a model for an ordinal outcome is that
the scale of $y^\ast$ is not identified by the data. Thus, the ordinal model
specifies that $\sigma_{\epsilon} = 1$, which implies that 
$\sigma_{y^\ast} = 1 / \sqrt{1 - R^2}$ is an intermediate parameter rather than
a primitive parameter.

It is somewhat more difficult to specify a prior value for the $R^2$ in an
ordinal model because $R^2$ refers to the proportion of variance in the
\emph{unobservable} $y^\ast$ that is attributable to the predictors under a
linear model. In general, the $R^2$ tends to be lower in an ordinal model than
in a linear model where the continuous outcome is observed.

The other difference is that an ordinal model does not have a global intercept 
but rather a vector of $J-1$ cutpoints. The implied prior on these cutpoints 
used by the __rstanarm__ package is somewhat novel. The user instead specifies a
Dirichlet prior on 
$\Pr\left(y=j \, \left.\right| \, \overline{\mathbf{x}} \right)$,
which is to say the prior probability of the outcome falling in each of the $J$
categories given that the predictors are at their sample means. The Dirichlet
prior is for a simplex random variable, whose elements are non-negative and sum
to $1$. The Dirichlet PDF can be written as 
$$f\left(\boldsymbol{\pi}|\boldsymbol{\alpha}\right) \propto 
\prod_{j=1}^J{\pi_j^{\alpha_j - 1}}, $$ 
where $\boldsymbol{\pi}$ is a simplex vector such that 
$\pi_j = \Pr\left(y=j \, \left.\right| \, \overline{\mathbf{x}} \right)$.

The Dirichlet prior is one of the easiest to specify because the so-called 
"concentration" hyperparameters $\boldsymbol{\alpha}$ can be interpreted as
prior counts, i.e., prior observations for each of the J categories (although
they need not be integers). If $\alpha_j = 1$ for every $j$ (the default used by
__rstanarm__) then the Dirichlet prior is jointly uniform over the space of
these simplexes. This corresponds to a prior count of one observation falling in
each of the $J$ ordinal categories when the predictors are at their sample means
and conveys the reasonable but weak prior information that no category has
probability zero. If, for each $j$, $\alpha_j = \alpha > 1$ then the prior mode
is that the $J$ categories are equiprobable, with prior probability $1/J$ of the
outcome falling in each of the $J$ categories. The larger the value of $\alpha$
the more sharply peaked the distribution is at the mode.

The $j$-th cutpoint $\zeta_j$ is then given by 
$$\zeta_j = F_{y^\ast}^{-1}\left(\sum_{i=1}^j{\pi_i}\right),$$ 
where  $F_{y^\ast}^{-1}$ is 
an inverse CDF function, which depends on the assumed distribution of $y^\ast$.
Common choices include the normal and logistic distributions. The scale
parameter of this distribution is again 
$\sigma_{y^\ast} = 1/\sqrt{1 - R^2}$. 
In short, by making each $\zeta_j$ a function of $\boldsymbol{\pi}$, it
allows us to specify a Dirichlet prior on $\boldsymbol{\pi}$, which is simpler
than specifying a prior on $\boldsymbol{\zeta}$ directly.

# Example

In this section, we start with an ordinal model of tobacco consumption as
a function of age and alcohol consumption. Frequentist estimates can be
obtained using the `polr` function in the __MASS__ package:
```{r polr-tobgp-mass}
library(MASS)
print(polr(tobgp ~ agegp + alcgp, data = esoph), digits = 1)
```
To obtain Bayesian estimates, we prepend `stan_` and specify the priors:
```{r polr-tobgp-mcmc, results="hide"}
library(rstanarm)
post0 <- stan_polr(tobgp ~ agegp + alcgp, data = esoph, 
                   prior = R2(0.25), prior_counts = dirichlet(1),
                   seed = 12345)
```
```{r}
print(post0, digits = 1)
```

```{r, polr-tobgp-cutpoints, echo=FALSE}
zeta_medians <- round(apply(rstan::extract(post0$stanfit, pars = "zeta")[[1]], 
                            2, median), digits = 2)
```
The point estimates, represented by the posterior medians, are qualitatively
similar to the maximum-likelihood estimates but are somewhat shrunk toward
zero due to the regularizing prior on the coefficients. Since these 
cutpoints are actually _known_, it would be more appropriate for the model to 
take that into account, but `stan_polr` does not currently support that.

Next, we utilize an example from the __MASS__ package where low birthweight is 
the binary outcome of interest. First, we recode some of the variables:
```{r polr-birthwt-recodes}
data("birthwt", package = "MASS")
birthwt$race <- factor(birthwt$race, levels = 1:3, 
                       labels = c("white", "black", "other"))
birthwt$bwt <- birthwt$bwt / 1000 # convert from grams to kilograms
birthwt$low <- factor(birthwt$low, levels = 0:1, labels = c("no", "yes"))
```
It is usually a good idea to rescale variables by constants so that all the
numbers are in single or double digits. We start by estimating a linear model
for birthweight in kilograms, flipping the sign so that positive coefficients
are associated with _lower_ birthweights.
```{r polr-stan_lm, results="hide"}
post1 <- stan_lm(-bwt ~ smoke + age + race + ptl + ht + ftv,
                 data = birthwt, prior = R2(0.5), 
                 seed = 12345)
```
```{r}
print(post1)
```

Next, we estimate an "ordinal" model for the incidence of low birthweight, which
is defined as a birth weight of less than $2.5$ kilograms. Even though this 
outcome is binary, a binary variable is a special case of an ordinal variable
with $J=2$ categories and is acceptable to `stan_polr`. We can think of `bwt`
as something proportional to $y^\ast$ and pretend that it is not observed, 
forcing us to estimate an ordinal model.
```{r polr-birthwt-mcmc, results="hide"}
post2 <- stan_polr(low ~ smoke + age + race + ptl + ht + ftv, data = birthwt,
                   prior = R2(0.5), prior_counts = dirichlet(c(1,1)), 
                   method = "probit", seed = 12345)
```
```{r, polr-loo-plot}
plot(loo(post2))
```

This prior seems to have worked well in this case because none of the points
in the plot are above $0.5$, which would have indicated the the posterior is very
sensitive to those observations. If we compare the estimated coefficients,
```{r polr-birthwt-comparison}
round(cbind(Linear = coef(post1), Ordinal = coef(post2), 
            Rescaled = coef(post1) / sigma(post1)), 3)

```
they have the same signs and similar magnitudes, with the exception of the 
"Intercept". In an ordinal model where the outcome only has $J=2$ categories, 
this "Intercept" is actually $\zeta_1$, but it is more conventional to call it
the "Intercept" so that it agrees with `stan_glm` when 
`family = binomial(link = 'probit')`. 
Recall that $\sigma_{\epsilon} = 1$ in an ordinal model, so if we
rescale the coefficients from a linear model by dividing by the posterior median
of $\sigma$, the resulting coefficients are even closer to those of the ordinal
model.

This illustrates the fundamental similarity between a linear model for a 
continuous observed outcome and a linear model for a latent $y^\ast$ that 
generates an ordinal observed outcome. The main difference is when the outcome
is continuous and observed, we can estimate the scale of the errors 
meaningfully. When the outcome is ordinal, we can only fix the scale of the
latent errors to $1$ arbitrarily.

Finally, when $J = 2$, the `stan_polr` function allows you to specify non-`NULL`
values of the `shape` and `rate` arguments, which implies a "scobit" likelihood
where the probability of success is given by $F\left(y^\ast \right)^\alpha$, 
where $F\left(\right)$ is the logistic CDF and $\alpha > 0$ is a skewing 
parameter that has a gamma prior with a given `shape` and `rate`. If 
$\alpha \neq 1$, then the relationship between $y^\ast$ and the probability of 
success is asymmetric. In principle, it seems appropriate to estimate $\alpha$ 
but in practice, a lot of data is needed to estimate $\alpha$ with adequate 
precision. In the previous example, if we specify `shape = 2` and `rate = 2` to 
reflect the prior beliefs that $\alpha$ is expected to be $1$ but has a variance 
of $\frac{1}{2}$, then the `loo` calculation yields many Pareto shape parameters 
that are excessively large. However, with more than $189$ observations, such a 
model may be more fruitful.

# Conclusion

The posterior distribution for an ordinal model requires priors on the 
coefficients and the cutpoints. The priors used by the `stan_polr` function are
unconventional but should work well for a variety of problems. The prior on the
coefficients is essentially the same as that used by the `stan_lm` function but
omits a scale parameter because the standard deviation of the latent $y^\ast$ is
not identified by the data. The cutpoints are conditionally deterministic given
a simplex vector for the probability of falling in each of the $J$ ordinal
categories given that the predictors are at their sample means. Thus, a
Dirichlet prior --- which is relatively easy to specify and has a good default
of jointly uniform --- on this simplex completes the posterior distribution.

This approach provides an alternative to `stan_glm` with `family = binomial()`
even if the outcome variable has only two categories. The `stan_glm` function
has more options for the prior on the coefficients and the prior on the 
intercept (which can be interpreted as the first cutpoint when $J = 2$).
However, it may be more difficult to obtain efficient sampling with those
priors.