---
title: "A set-based association test in snpsettest"
author: "Jaehyun Joo"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A set-based association test in snpsettest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

For set-based association tests, the **snpsettest** package employed the
statistical model described in VEGAS (**ve**rsatile **g**ene-based
**a**ssociation **s**tudy) [1], which takes as input variant-level p values and
reference linkage disequilibrium (LD) data. Briefly, the test statistics is
defined as the sum of squared variant-level Z-statistics. Letting a set of $Z$
scores of individual SNPs $z_i$ for $i \in 1:p$ within a set $s$, the test
statistic $Q_s$ is

$$Q_s = \sum_{i=1}^p z_i^2$$

Here, $Z = \{z_1,...,z_p\}'$ is a vector of multivariate normal distribution
with a mean vector $\mu$ and a covariance matrix $\Sigma$ in which $\Sigma$
represents LD among SNPs. To test a set-level association, we need to evaluate
the distribution of $Q_s$. VEGAS uses Monte Carlo simulations to approximate the
distribution of $Q_s$ (directly simulate $Z$ from multivariate normal
distribution), and thus, compute a set-level p value. However, its use is
hampered in practice when set-based p values are very small because the number
of simulations required to obtain such p values is be very large. The
**snpsettest** package utilizes a different approach to evaluate the
distribution of $Q_s$ more efficiently.

Let $Y
= \Sigma^{-\frac12}Z$ (instead of $\Sigma^{-\frac12}$, we could use any
decomposition that satisfies $\Sigma = AA'$ with a $p \times p$ non-singular
matrix $A$ such that $Y = A^{-1}Z$). Then,

$$
\begin{gathered}
E(Y) = \Sigma^{-\frac12} \mu \\
Var(Y) = \Sigma^{-\frac12}\Sigma\Sigma^{-\frac12} = I_p \\
Y \sim N(\Sigma^{-\frac12} \mu,~I_p)
\end{gathered}
$$

Now, we posit $U = \Sigma^{-\frac12}(Z - \mu)$ so that

$$U \sim N(\mathbf{0}, I_p),~~U = Y - \Sigma^{-\frac12}\mu$$

and express the test statistic $Q_s$ as a quadratic form:

$$
\begin{aligned}
  Q_s &= \sum_{i=1}^p z_i^2 = Z'I_pZ = Y'\Sigma^{\frac12}I_p\Sigma^{\frac12}Y \\
      &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu)
\end{aligned}
$$

With the spectral theorem, $\Sigma$ can be decomposed as follow:

$$
\begin{gathered}
\Sigma = P\Lambda P' \\ \Lambda = \mathbf{diag}(\lambda_1,...,\lambda_p),~~P'P
= PP' = I_p
\end{gathered}
$$

where $P$ is an orthogonal matrix. If we set $X = P'U$, $X$ is a vector of
independent standard normal variable $X \sim N(\mathbf{0}, I_p)$ since

$$E(X) = P'E(U) = \mathbf{0},~~Var(X) = P'Var(U)P = P'I_pP = I_p$$

$$
\begin{aligned}
  Q_s &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu) \\
  &= (U + \Sigma^{-\frac12}\mu)'P\Lambda P'(U + \Sigma^{-\frac12}\mu) \\
  &= (X + P'\Sigma^{-\frac12}\mu)'\Lambda (X + P'\Sigma^{-\frac12}\mu)
\end{aligned}
$$

Under the null hypothesis, $\mu$ is assumed to be $\mathbf{0}$. Hence,

$$Q_s = X'\Lambda X = \sum_{i=1}^p \lambda_i x_i^2$$

where $X = \{x_1,...,x_p\}'$. Thus, the null distribution of $Q_s$ is a linear
combination of independent chi-square variables $x_i^2 \sim \chi_{(1)}^2$ (i.e.,
central quadratic form in independent normal variables). For computing a
probability with a scalar $q$,

$$Pr(Q_s > q)$$

several methods have been proposed, such as numerical inversion of the
characteristic function [2]. The **snpsettest** package uses the algorithm of
Davies [3] or saddlepoint approximation [4] to obtain set-based p values.


**References**

1. Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A
Versatile Gene-Based Test for Genome-wide Association Studies. Am J Hum Genet.
2010 Jul 9;87(1):139–45.

2. Duchesne P, De Micheaux P. Computing the distribution of quadratic forms:
Further comparisons between the Liu-Tang-Zhang approximation and exact methods.
Comput Stat Data Anal. 2010;54:858–62.

3. Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of
Chi-square Random Variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.

4. Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in
Normal Variables. Biometrika. 1999;86(4):929–35.
