---
title: "Tutorial: Test for etiologic heterogeneity in a case-control study"
author: "Emily C. Zabor"
date: "Last updated: `r format(Sys.Date(), '%B %d, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tutorial: Test for etiologic heterogeneity in a case-control study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(riskclustr)
```


## Introduction

In epidemiologic studies polytomous logistic regression is commonly used in the study of etiologic heterogeneity when data are from a case-control study, and the method has good statistical properties. Although polytomous logistic regression can be implemented using available software, the additional calculations needed to perform a thorough analysis of etiologic heterogeneity are cumbersome. To facilitate use of this method we provide functions `eh_test_subtype()` and `eh_test_marker()` to address two key questions regarding etiologic heterogeneity:

1. Do risk factor effects differ according to disease subtypes?
2. Do risk factor effects differ according to individual disease markers that combine to form disease subtypes?


Whether disease subtypes are pre-specified or formed by cross-classification of individual disease markers, the resulting polytomous logistic regression model is the same. Let $i$ index study subjects, $i = 1, \ldots, N$, let $m$ index disease subtypes, $m = 0, \ldots M$, where $m=0$ denotes control subjects, and let $p$ index risk factors, $p = 1, \ldots, P$. The polytomous logistic regression model is specified as

$$\Pr(Y = m | \mathbf{X}) = \frac{\exp(\mathbf{X}^T \boldsymbol{\beta}_{\boldsymbol{\cdot} m})}{\mathbf{1} + \exp(\mathbf{X}^T \boldsymbol{\beta}) \mathbf{1}}$$
where $\mathbf{X}$ is the $(P+1) \times N$ matrix of risk factor values, with the first row all ones for the intercept, and $\boldsymbol{\beta}$ is the $(P+1) \times M$ matrix of regression coefficients. $\boldsymbol{\beta}_{\boldsymbol{\cdot} m}$ indicates the $m$th column of the matrix $\boldsymbol{\beta}$ and $\mathbf{1}$ represents a vector of ones of length $M$.


## Pre-specified subtypes {#subtype}

If disease subtypes are pre-specified, either based on clustering high-dimensional disease marker data or based on a single disease marker or combinations of disease markers, then statistical tests for etiologic heterogeneity according to each risk factor can be conducted using the `eh_test_subtype()` function. 

Estimates of the parameters of interest related to the question of whether risk factor effects differ across subtypes of disease, $\hat{\boldsymbol{\beta}}$, and the associated estimated variance-covariance matrix, $\widehat{cov}(\hat{\boldsymbol{\beta}})$, are obtained directly from the resulting polytomous logistic regression model. Each $\beta_{pm}$ parameter represents the log odds ratio for a one-unit change in risk factor $p$ for subtype $m$ disease versus controls. Hypothesis tests for the question of whether a specific risk factor effect differs across subtypes of disease can be conducted separately for each risk factor $p$ using a Wald test of the hypothesis 

$$H_{0_{\beta_{p.}}}: \beta_{p1} = \dots = \beta_{pM}$$ 
Using the `subtype_data` simulated dataset, we can examine the influence of risk factors `x1`, `x2`, and `x3` on the 4 pre-specified disease subtypes in variable `subtype` using the following code:

```{r}
library(riskclustr)

mod1 <- eh_test_subtype(
  label = "subtype", 
  M = 4, 
  factors = list("x1", "x2", "x3"), 
  data = subtype_data)
```

See the function [documentation](https://www.emilyzabor.com/riskclustr/reference/eh_test_subtype.html) for details of function arguments.  

The resulting estimates $\hat{\boldsymbol{\beta}}$ can be accessed with

```{r eval = FALSE}
mod1$beta
```


```{r echo = FALSE}
knitr::kable(mod1$beta)
```

the associated standard deviation estimates $\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}$ with

```{r eval = FALSE}
mod1$beta_se
```


```{r echo = FALSE}
knitr::kable(mod1$beta_se)
```

and the heterogeneity p-values with 

```{r eval = FALSE}
mod1$eh_pval
```


```{r echo = FALSE}
knitr::kable(mod1$eh_pval)
```

An overall formatted dataframe containing $\hat{\boldsymbol{\beta}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\Big)$ and heterogeneity p-values `p_het` to test the null hypotheses $H_{0_{\beta_{p.}}}$ can be obtained as

```{r eval = FALSE}
mod1$beta_se_p
```


```{r echo = FALSE}
knitr::kable(mod1$beta_se_p)
```

Because it is often of interest to examine associations in a case-control study on the odds ratio (OR) scale rather than the original parameter estimate scale, it is also possible to obtain a matrix containing $OR=\exp(\hat{\boldsymbol{\beta}})$, along with 95\% confidence intervals and heterogeneity p-values `p_het` to test the null hypotheses $H_{0_{\beta_{p.}}}$ using

```{r eval = FALSE}
mod1$or_ci_p
```


```{r echo = FALSE}
knitr::kable(mod1$or_ci_p)
```


## Subtypes formed by cross-classification of disease markers

If disease subtypes are formed by cross-classifying individual binary disease markers, then statistical tests for associations between risk factors and individual disease markers can be conducted using the `eh_test_marker()` funtion.

Let $k$ index disease markers, $k = 1, \ldots, K$. Here the $M$ disease subtypes are formed by cross-classification of the $K$ binary disease markers, so that we have $M = 2^K$ disease subtypes.

To evaluate the independent influences of individual disease markers, it is convenient to transform the parameters in $\boldsymbol{\beta}$ using the one-to-one linear transformation 

$$\hat{\boldsymbol{\gamma}} = \frac{\hat{\boldsymbol{\beta}} \mathbf{L}}{M/2}.$$
Here $\mathbf{L}$ is an $M \times K$ contrast matrix such that the entries are -1 if disease marker $k$ is absent for disease subtype $m$ and 1 if disease marker $k$ is present for disease subtype $m$. $\boldsymbol{\gamma}$ is then the $(P+1) \times K$ matrix of parameters that reflect the independent effects of distinct disease markers. Each element of the $\boldsymbol{\gamma}$ parameters represents the average of differences in log odds ratios between disease subtypes defined by different levels of the $k$th disease marker with respect to the $p$th risk factor when the other disease markers are held constant. Variance estimates corresponding to each $\hat{\gamma}_{pk}$ are obtained using

$$\widehat{var}(\hat{\gamma}_{pk}) = \left(\frac{M}{2}\right)^{-2} \mathbf{L}_{\boldsymbol{\cdot} k}^T \widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}}^T) \mathbf{L}_{\boldsymbol{\cdot} k}$$
where $\mathbf{L}_{\boldsymbol{\cdot} k}$ is the $k$th column of the $\mathbf{L}$ matrix and the estimated variance-covariance matrix $\widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}})$ for each risk factor $p$ is obtained directly from the polytomous logistic regression model. 

Hypothesis tests for the question of whether a risk factor effect differs across levels of each individual disease marker of which the disease subtypes are comprised can be conducted separately for each combination of risk factor $p$ and disease marker $k$ using a Wald test of the hypothesis 

$$H_{0_{{\gamma_{pk}}}}: \gamma_{pk} = 0.$$
Using the `subtype_data` simulated dataset, we can examine the influence of risk factors `x1`, `x2`, and `x3` on the two individual disease markers `marker1` and `marker2`. These two binary disease markers will be cross-classified to form four disease subtypes that will be used as the outcome in the polytomous logistic regression model to obtain the $\hat{\boldsymbol{\beta}}$ estimates, which are then transformed in order to obtain estimates and hypothesis tests related to the individual disease markers.

```{r}
library(riskclustr)

mod2 <- eh_test_marker(
  markers = list("marker1", "marker2"),
  factors = list("x1", "x2", "x3"),
  case = "case",
  data = subtype_data)
```

See the function [documentation](https://www.emilyzabor.com/riskclustr/reference/eh_test_marker.html) for details of function arguments.  

The resulting estimates $\hat{\boldsymbol{\gamma}}$ can be accessed with

```{r eval = FALSE}
mod2$gamma
```


```{r echo = FALSE}
knitr::kable(mod2$gamma)
```

the associated standard deviation estimates $\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}$ with

```{r eval = FALSE}
mod2$gamma_se
```


```{r echo = FALSE}
knitr::kable(mod2$gamma_se)
```

and the associated p-values with 

```{r eval = FALSE}
mod2$gamma_pval
```


```{r echo = FALSE}
knitr::kable(mod2$gamma_pval)
```

An overall formatted dataframe containing the $\hat{\boldsymbol{\gamma}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\Big)$ and p-values to test the null hypotheses $H_{0_{\gamma_{pk}}}$ can be obtained as

```{r eval = FALSE}
mod2$gamma_se_p
```


```{r echo = FALSE}
knitr::kable(mod2$gamma_se_p)
```

The estimates and heterogeneity p-values for disease subtypes formed by cross-classifying these individual disease markers can also be accessed in objects `beta_se_p` and `or_ci_p`, as described in the section on [Pre-specified subtypes](#subtype).