---
title: "tmod: Analysis of Gene Set Enrichments"
author: "January Weiner"
date: "`r Sys.Date()`"
output: 
  html_vignette:
    self_contained: true
    css: "style.css"
link-citations: true
colorlinks: true
monofontoptions: Scale=0.8
documentclass: report
fontsize: 12pt
numbersections: true
header-includes:
   - \usepackage{extsizes}
vignette: >
  %\VignetteIndexEntry{tmod: Analysis of Gene Set Enrichments}
  %\VignetteKeyword{tmod}
  %\VignetteKeyword{gene set enrichment analysis}
  %\VignetteKeyword{GSE}
  %\VignetteKeyword{transcription}
  %\VignetteKeyword{gene expression}
  %\VignetteEngine{knitr::rmarkdown}
  %\SweaveUTF8
  \usepackage[utf8](inputenc)
abstract: |
  The package includes several tools for testing
  the significance of enrichment of modules or other gene sets as well as
  visualisation of the features (genes, metabolites etc.) and gene sets,
  feature sets or
  transcriptional modules.
  Furthermore, the package provides blood transcriptional modules described
  by Chaussabel et al. [-@chaussabel2008modular] and by Li et al.
  [-@li2014molecular] as well as metabolic profiling clusters from Weiner
  et al. [-@weiner2012biomarkers].  This
  user guide is a tutorial and main documentation for the package.
bibliography: bibliography.bib
---

```{r setup, echo=FALSE, results="hide", include=FALSE}
library(knitr)
opts_chunk$set(cache=TRUE, autodep=TRUE, tidy=FALSE, fig.width=5, warning=FALSE, fig.height=5, width=60)
opts_knit$set(width=60)
is.internet <- FALSE
```


```{r eval=TRUE,echo=FALSE}
library(ggplot2)
theme_set(theme_minimal())
```

<img src="logo.png" style="position:absolute;top:25px;right:5px;" />

Before you start
================

A more comprehensive user manual for tmod is found on
[github](https://january3.github.io/tmod/). This document is a shortened
version of the manual, adapted to ship as a vignette with the package.
Therefore, if you are not offline, I strongly suggest to use the online
version.

Quick start guide
=================

First, obtain an ordered list of genes. For example, list of genes ordered
by q-values (FDR):


```{r qsg1}
library(tmod)
data(EgambiaResults)
tt <- EgambiaResults
tt <- tt[ order(tt$adj.P.Val), ]
l  <- tt$GENE_SYMBOL
```

Run the CERNO test:


```{r qsg2}
res <- tmodCERNOtest(l)
head(res)
```

Visualize the results:


```{r qsg3}
ggPanelplot(res)
```

Inspect individual gene sets using evidence plots:


```{r qsg4,eval=FALSE}
ggEvidencePlot(l, "LI.M37.0", gene.labels=FALSE)
ggEvidencePlot(l, "DC.M4.2")
ggEvidencePlot(l, "DC.M3.4")
ggEvidencePlot(l, "DC.M1.2")
```


```{r fig.width=10,fig.height=10,echo=FALSE}
library(cowplot)
g1 <- ggEvidencePlot(l, "LI.M37.0", gene.labels=FALSE) +
  ggtitle("LI.M37.0")
g2 <- ggEvidencePlot(l, "DC.M4.2") +
  ggtitle("DC.M4.2")
g3 <- ggEvidencePlot(l, "DC.M3.4") +
  ggtitle("DC.M3.4")
g4 <- ggEvidencePlot(l, "DC.M1.2") +
  ggtitle("DC.M1.2")
plot_grid(g1, g2, g3, g4)
```

Find out which genes are in DC.M1.2:


```{r}
getModuleMembers("DC.M1.2")
```















Introduction
============

Gene set enrichment (GSE) analysis is an increasingly important tool in
the biological interpretation of high throughput data, versatile and
powerful. In general, there are three generations of GSE algorithms and
packages. 

First generation approaches test for enrichment in defined sets of
differentially expressed genes (often called "foreground") against the set
of all genes ("background"). The statistical test involved is usually a
hypergeometric or Fisher's exact test. The main problem with this kind of
approach is that it relies on arbitrary thresholds (like p-value or log fold
change cut-offs), and the number of genes that go into the "foreground" set
depends on the statistical power involved. Comparison between the same
experimental condition will thus yield vastly different results depending
on the number of samples used in the experiment.

The second generation of GSE involve tests which do not rely on such
arbitrary definitions of what is differentially expressed, and what not,
and instead directly or indirectly employ the information about the
statistical distribution of individual genes. A popular implementation of
this type of GSE analysis is the eponymous GSEA program [@subramanian2005gene].
While popular and quite powerful for a range of applications, this software
has important limitations due to its reliance on bootstrapping to obtain an
exact p-value. For one thing, the performance of GSEA dramatically
decreases for small sample numbers [@weiner2016tmod]. Moreover, the specifics of the
approach prevent it from being used in applications where a direct test for
differential expression is either not present (for example, in multivariate
functional analysis, see Section 
["Functional multivariate analysis"](#functional-multivariate-analysis)).

The `tmod` package [@zyla2019gene] and the included CERNO^[Coincident Extreme Ranks in
Numerical Observations [@yamaguchi2008ifn]] test belong to the second
generation of algorithms. However, unlike the program GSEA, the CERNO
relies exclusively on an ordered list of genes, and the test statistic has
a χ² distribution. Thus, it is suitable for any application in which an
ordered list of genes is generated: for example, it is possible to apply
tmod to weights of PCA components or to variable importance measure of a
machine learning model.

`tmod` was created with the following properties in mind: (i) test for
enrichment which relies on a list of sorted genes, (ii) with an analytical
solution, (iii) flexible, allowing custom gene sets and analyses, (iv) with
visualizations of multiple analysis results, suitable for time series and
suchlike, (v) including transcriptional module definitions not present in
other databases and, finally, (vi) to be suitable for use in R.




Dive into tmod: analysis of transcriptomic responses to tuberculosis 
====================================================================

Introduction
------------

In this chapter, I will use an example data set included in tmod to show
the application of tmod to the analysis of differential gene expression.
The data set has been generated by Maertzdorf et al.
[-@maertzdorf2011functional] and has the GEO ID GSE28623.  Is based on
whole blood RNA microarrays from tuberculosis (TB) patients and healthy
controls.

Although microarrays were used to generate the data, the principle is the
same as in RNASeq.


The Gambia data set
-------------------

In the following, we will use the Egambia data set included in the package.
The data is already background corrected and normalized, so we can proceed
with a differential gene expression analysis. Note that only a bit over
5000 genes from the original set of over 45000 probes is included. 

```{r twoA,eval=TRUE}
library(tmod)
data(Egambia)
E <- as.matrix(Egambia[,-c(1:3)])
```

The
results for this data set have been obtained using the limma package as
follows:


```{r limma,eval=FALSE}
library(limma)
design <- cbind(Intercept=rep(1, 30), TB=rep(c(0,1), each= 15))
fit <- eBayes(lmFit(E, design))
tt <- topTable(fit, coef=2, number=Inf, 
  genelist=Egambia[,1:3])
```

However, they are also included in the tmod package:

```{r twoA.2}
data(EgambiaResults)
tt <- EgambiaResults
```

The table below shows first couple of results from the table `tt`.

```{r twoA2,echo=FALSE, results="asis"}
library(pander)
options(digits=3)
tmp <- tt[,c("GENE_SYMBOL", "GENE_NAME", "logFC", "adj.P.Val")]
rownames(tmp) <- NULL
pandoc.table(head(tmp), split.tables=Inf, justify="llrr")
```

OK, we see some of the genes known to be prominent in the human host
response to TB. We can display one of these using tmod's `showGene` 
function (it's just a `boxplot` combined with a `beeswarm`, nothing
special):

```{r basic2, fig.width=4, fig.height=4}
group <- rep( c("CTRL", "TB"), each=15)
showGene(E["20799",], group,
  main=Egambia["20799", "GENE_SYMBOL"])
```

Fine, but what about the gene sets?


Transcriptional module analysis with GSE
-----------------------------------------

There are two main functions in tmod to understand which modules or gene
sets are significantly enriched. There are several statistical tests which can be used from within tmod (see
chapter ["Statistical tests in tmod"](#statistical-tests-in-tmod) below),
but here we will use the CERNO test, which is the main reason this package
exist. CERNO is particularly fast and robust second generation approach,
recommended for most applications. 

CERNO works with an ordered list of genes (only ranks matter, no other
statistic is necessary); the idea is to test, for each gene set, whether
the genes in this gene set are more likely than others to be at the
beginning of that list.  The CERNO statistic has a $\chi^2$ distribution
and therefore no randomization is necessary, making the test really fast.

```{r fourB}
l    <- tt$GENE_SYMBOL
resC <- tmodCERNOtest(l)
head(resC, 15)
```

Only first 15 results are shown above.

Columns in the above table contain the following:

 * **ID** The module ID. IDs starting with "LI" come from Li et al.
[@li2014molecular], while IDs starting with "DC" have been defined by
Chaussabel et al. [@chaussabel2008modular].
 * **Title** The module description
 * **cerno** The CERNO statistic
 * **N1** Number of genes in the module
 * **AUC** The area under curve – main size estimate
 * **cES** CERNO statistic divided by $2\times N1$
 * **P.Value** P-value from the hypergeometric test
 * **adj.P.Val** P-value adjusted for multiple testing using the
Benjamini-Hochberg correction

These results make a lot of sense: the transcriptional modules found to be
enriched in a comparison of TB patients with healthy individuals are in
line with the published findings. In especially, we see the interferon
response, complement system as well as T-cell related modules.

Visualizing results
---------------------

The main working horse for visualizing the results in tmod is the function
`ggPanelplot`. This is really a glorified heatmap which shows both the
effect size (size of the blob on the figure below) and the p-value
(intensity of the color). Each column corresponds to a different
comparison. Here, there will be only one column for the only comparison we
made, however we need to wrap it in a `list` object. However, we can also
use a slightly different representation to also show how many significantly
up- and down-regulated^[Formally, we don't test for regulation here.
"Differentially expressed" is the correct expression. I will use, however,
the word "regulated" throughout this manual with the understanding that it
only means "there is a difference between two conditions" because it is shorter.] 
genes are found in the enriched modules (right panel on the figure below).

Note: below I use the `plot_grid` function from the cowplot package to put
the figures side by side.

```{r panelplot00,fig.width=14,fig.height=7}
library(cowplot)

g1 <- ggPanelplot(list(Gambia=resC))

## calculate the number of significant genes
## per module
sgenes <- tmodDecideTests(g=tt$GENE_SYMBOL,
  lfc=tt$logFC,
  pval=tt$adj.P.Val)
names(sgenes) <- "Gambia"
g2 <- ggPanelplot(list(Gambia=resC), sgenes = sgenes)
plot_grid(g1, g2, labels=c("A", "B"))
```

On the right hand side, the red color on the bars indicates that all
signficantly regulated in the enriched modules. The size of the bar
corresponds to the AUC, and intensity of the color corresponds to the
p-value.  See chapter "Visualisation and presentation of results in tmod"
for more information on this and other functions.


Statistical tests in tmod
=========================

Introduction
------------

There is a substantial numer of different gene set enrichment tests.
Several are implemented in `tmod` (see Table below for a summary). This
chapter gives an overview of the possibilities for gene set enrichment
analysis with `tmod`.

-----------------------------------------------------------------------------------------------
Test              Description            Input type
----------------- ---------------------- ---------------------------
tmodHGtest        First generation test  Two sets, foreground
                                         and background

tmodUtest         Wilcoxon U test        Ordered gene list

tmodCERNOtest     CERNO test             Ordered gene list

tmodZtest         variant of             Ordered gene list
                  the CERNO test

tmodPLAGEtest     eigengene-based        Expression matrix

tmodAUC           general permutation    Matrix of ranks   
                  based testing

tmodGeneSetTest   permutation based      A statistic
                  on a particular        (e.g. logFC)
                  statistic
-----------------------------------------------------------------------------------------------



In the following, I will briefly describe the various tests and show
examples of usage on the Gambia data set.





First generation tests
----------------------

First generation tests were based on an overrepresentation analysis (ORA).
In essence, they rely on splitting the genes into two groups: the genes of
interest (foreground), such as genes that we consider to be significantly
regulated in an experimental condition, and all the rest (background).
For a given gene set, this results in a $2\times 2$ contingency table. If
these two factors are independent (i.e., the probability of a gene
belonging to a gene set is independent of the probability of a gene being
regulated in the experimental condition), then we can easily derive
expected frequencies for each cell of the table. Several statistical tests
exist to test whether the expected frequencies differ significantly from
the observed frequencies.

In tmod, the function `tmodHGtest()`, performs a hypergeometric test on two
groups of genes: 'foreground' (fg), or the list of differentially expressed
genes, and 'background' (bg) -- the gene universe, i.e., all genes present
in the analysis. The gene identifiers used currently by tmod are HGNC
identifiers, and we will use the GENE_SYMBOL field from the Egambia data
set.

In this particular example, however, we have almost no genes which are
significantly differentially expressed after correction for multiple
testing: the power of the test with 10 individuals in each group is too low.
For the sake of the example, we will therefore relax our selection.
Normally, I'd use a q-value threshold of at least 0.001.

```{r mod1}
fg <- tt$GENE_SYMBOL[tt$adj.P.Val < 0.05 & abs( tt$logFC ) > 1]
resHG <- tmodHGtest(fg=fg, bg=tt$GENE_SYMBOL)
options(width=60)
resHG
```

The columns in the above table contain the following:

 * **ID** The module ID. IDs starting with "LI" come from Li et al.
[@li2014molecular], while IDs starting with "DC" have been defined by
Chaussabel et al. [@chaussabel2008modular].
 * **Title** The module description
 * **b** Number of genes from the given module in the fg set 
 * **B** Number of genes from the module in the bg set
 * **n** Size of the fg set
 * **N** Size of the bg set
 * **E** Enrichment, calcualted as (b/n)/(B/N)
 * **P.Value** P-value from the hypergeometric test
 * **adj.P.Val** P-value adjusted for multiple testing using the
Benjamini-Hochberg correction

Well, IFN signature in TB is well known. However, the numbers of genes are
not high: n is the size of the foreground, and b the number of genes in fg
that belong to the given module. N and B are the respective totals -- size
of bg+fg and number of genes that belong to the module that are found in this
totality of the analysed genes. If we were using the full Gambia data set
(with all its genes), we would have a different situation.

Lack of significant genes is the main problem of ORA: splitting the genes
into foreground and background relies on an arbitrary threshold which will
yield very different results for different sample sizes. 


Second generation tests
-----------------------

### U-test (`tmodUtest`)

Another approach is to sort all the genes (for example, by the respective
p-value) and perform a U-test on the ranks of (i) genes belonging to the
module and (ii) genes that do not belong to the module. This is a bit
slower, but often works even in the case if the power of the statistical
test for differential expression is low. That is, even if only a few genes
or none at all are significant at acceptable thresholds, sorting them by
    the p-value or another similar metric can nonetheless allow to get
meaningful enrichments^[The rationale is that the non-significant
p-values are not associated with the test that we are actually performing,
but merely used to sort the gene list. Thus, it does not matter whether
they are significant or not.].

Moreover, we do not need to set arbitrary
thresholds, like p-value or logFC cutoff.

The main issue with the U-test is that it detects enrichments as well as
depletions -- that is, modules which are enriched at the bottom of the list
(e.g.  modules which are never, ever regulated in a particular comparison)
will be detected as well. This is often undesirable. Secondly, large
modules will be reported as significant even if the actual effect size
(i.e., AUC) is modest or very small, just because of the sheer number of
genes in a module. Unfortunately, also the reverse is true: modules with a
small number of genes, even if they consist of highly up- or down-regulated
genes from the top of the list will not be detected.

```{r mod2}
l    <- tt$GENE_SYMBOL
resU <- tmodUtest(l)
head(resU)
nrow(resU)
```

This list makes a lot of sense, and also is more stable than the other one:
it does not depend on modules that contain just a few genes. Since the
statistics is different, the b, B, n, N and E columns in the output have
been replaced by the following:

 * **U** The Mann-Whitney U statistics
 * **N1** Number of genes in the module
 * **AUC** Area under curve -- a measure of the effect size

A U-test has been also implemented in limma in the `wilcoxGST()`
function.

### CERNO test (`tmodCERNOtest` and `tmodZtest`)

There are two tests in tmod which both operate on an ordered list of genes:
`tmodUtest` and `tmodCERNOtest`. The U test is simple, however
has two main issues. Firstly, 
The CERNO test, described by Yamaguchi et al. [-@yamaguchi2008ifn], is
based on Fisher's method of combining probabilities. In summary, for a
given module, the scaled ranks of genes from the module are treated as
probabilities. These are then logarithmized, summed and multiplied by -2:

$$f_{CERNO}=-2 \cdot \sum_{i = 1}^{N} \ln{\frac{R_i}{N_{tot}}}$$

This statitic has the $\chi^2$ distribution with $2\cdot N$ degrees of
freedom, where $N$ is the number of genes in a given module and $N_{tot}$
is the total number of genes [@yamaguchi2008ifn].

The CERNO test is actually much more practical than other tests for most
purposes and is the recommended approach. A variant called `tmodZtest`
exists in which the p-values are combined using Stouffer's method rather
than the Fisher's method.

```{r mod2cerno}
l    <- tt$GENE_SYMBOL
resCERNO <- tmodCERNOtest(l)
head(resCERNO)
nrow(resCERNO)
```


### PLAGE

PLAGE [@tomfohr2005pathway] is a gene set enrichment method based on
singular value decomposition (SVD). The idea is that instead of running a
statistical test (such as a t-test) on each gene separately, information
present in the gene expression of all genes in a gene set is first extracted
using SVD, and the resulting vector (one per gene set) is treated as a
"pseudo gene" and analysed using the approppriate statistical tool. 

In the tmod implementation, for each module a gene expression matrix subset
is generated and decomposed using PCA using the `eigengene()` function. The
first component is returned and a t-test comparing two groups is then
performed. This limits the implementation to only two groups, but extending
it for more than one group is trivial.

```{r plage}
tmodPLAGEtest(Egambia$GENE_SYMBOL, Egambia[,-c(1:3)], group=group)
```



Visualisation and presentation of results in tmod
=================================================

Introduction
------------

By default, results produced by tmod are data frames containing one row per
tested gene set / module. In certain circumstances, when multiple tests are
performed, the returned object is a list in which each element is a results
table. In other situations a list can be created manually. In any case, it
is often necessary to extract, compare or summarize one or more result
tables.


Visualizing gene sets with eigengene
------------------------------------

One of the most frequent demands is to somehow show differences in
transcription in the gene sets between groups. This is usually quite
bothersome. One possibility is to use a heatmap, another to show how all
genes vary between the groups. Both these options result in messy,
frequently biased pictures (for example, if only the "best" genes are
selected for a heatmap).

The code below shows how to select genes from a module with the
`getModuleMembers()` function. For this, we will use the Gambia data set
and the LI.M75 interferon module.

```{r heatmap0}
m <- "LI.M75"
## getModuleMembers returns a list – you can choose to 
## select multiple modules
genes <- getModuleMembers(m)[[1]]
sel <- Egambia$GENE_SYMBOL %in% genes
x <- data.matrix(Egambia)[sel, -c(1:3)] # expression matrix
```

Matix x contains expression of all genes from the selected module (one row
per gene, one column per sample).

Then, heatmaps and line plots can be generated. Note that both of the
approaches below are discouraged. Firstly, these representations are
chaotic and hard to read. Secondly, it is easy to manipulate such plots in
order to make the effects more prominent than they really are. Thirdly,
they use up a lot of space and ink to convey very little useful and
interpretable information, which is a sin[@tufte2014visual].

```{r heatmap1,fig.width=12,fig.height=7,echo=FALSE}
x0 <- t(scale(t(x)))

cols <- colorRampPalette(c("blue", "white", "red"))(17) 
library(gplots)
heatmap.2(x0, trace="n", 
  labRow=Egambia$GENE_SYMBOL[sel], 
  scale="n", 
  dendrogram="r", Colv=F, 
  col=cols,
  breaks=seq(-2.5, 2.5, length.out=18))
```

```{r heatmap2,fig.width=10,fig.height=7,echo=FALSE}
## per group means and standard deviations
ms <- apply(x, 1, function(xx) tapply(xx, group, mean))
sd <- apply(x, 1, function(xx) tapply(xx, group, sd))

library(plotwidgets)
plot(NULL, bty="n", ylim=range(x), xlim=c(0.5, 2.5), 
  xaxt="n", xlab="group", ylab="Expression")
axis(1, at=c(1,2), labels=unique(group))
errbar <- function(x, y0, y1, w=0.1, ...) {
  w <- w/2
  segments(x, y0, x, y1, ...)
  segments(x-w, y0, x+w, y0, ...)
  segments(x-w, y1, x+w, y1, ...)
}

cols <- plotPals("default")
n <- ncol(ms)
segments(1, ms[1,], 2, ms[2,], col=cols, lwd=2)
points(rep(1, n), ms[1,], pch=19, col=cols)
points(rep(2, n), ms[2,], pch=19, col=cols)
errbar(1, ms[1,]-sd[1,], ms[1,]+sd[1,], col=cols)
errbar(2, ms[2,]-sd[2,], ms[2,]+sd[2,], col=cols)
```

A better (but not perfect) approach is to use *eigengenes*. An eigengene is
a vector of numbers which are thought to represent all genes in a gene set.
It is calculated by running a principal component analysis on an expression
matrix of the genes of interest. This vector can be thought of as an
"average" or "representative" gene of a gene set.

Below, we calculate all eigengenes from modules in tmod and display two of
them. The object `eig` will contain one row per module and one column per
sample.

```{r eigengene,fig.width=8,fig.height=5}
par(mfrow=c(1,2))
eig <- eigengene(Egambia[,-c(1:3)], Egambia$GENE_SYMBOL)
showGene(eig["LI.M75", ], group, 
  ylab="Eigengene",
  main="antiviral Interferon signature")
showGene(eig["LI.M16", ], group, 
  ylab="Eigengene",
  main="TLR and inflammatory signaling")
```

In fact, one can compare the eigengenes using a t.test or another
statistical procedure – this is the essence of the PLAGE algorithm,
described earlier.


Showing enrichment with evidence plots
--------------------------------------

Let us first investigate in more detail the module LI.M75, the antiviral
interferon signature. We can use the `ggEvidencePlot` function to see
how the module is enriched in the list `l`.

```{r five,fig=TRUE,fig.width=7,fig.height=4}
l    <- tt$GENE_SYMBOL
theme_set(theme_minimal())
ggEvidencePlot(l, "LI.M75") 
```

In essence, this is a receiver-operator characteristic (ROC) curve, and the
area under the curve (AUC) is related to the U-statistic, from which the P-value
in the tmodUtest is calculated, as 
$\text{AUC}=\frac{U}{n_1\cdot n_2}$.
Both the U statistic and the AUC are reported. Moreover, the AUC can be
used to calculate effect size according to the Wendt's formula[@wendt1972dealing] for
rank-biserial correlation coefficient:

$$r=1-\frac{2\cdot U}{n_1\cdot n_2} = 1 - 2\cdot\text{AUC}$$

In the above diagram, we see that nine out of the 10 genes that belong to
the LI.M75 module and which are present in the Egambia data set are ranked
among the top 1000 genes (as sorted by p-value).

Evidence plots are an important tool to understand whether the obtained
results are not only significant in the statistical sense, but whether they
are also of biological interest. As mentioned above, the area under the
evidence plot, the AUC (area under curve) is an important measure of the
effect size. Effect size should always be taken into consideration when
interpreting p-values.

The figure below shows three very different enrichments.


```{r fig.width=12}
library(purrr)
sel <- c("LI.M67", "LI.M37.0")
plots <- map(sel, ~ ggEvidencePlot(l, .x, gene.labels=FALSE))
plot_grid(plotlist = plots, labels=sel)
```

The corresponding results are shown in the following table:


```{r}
foo <- tmodCERNOtest(l) %>% dplyr::filter(ID %in% sel) 
foo %>% knitr::kable()
```

`r sel[1]`
(`r foo[sel[1], "Title"]`)
 contains only 
`r foo[sel[1], "N1"]`
genes, but as most 
of them have really low p-values and are on the top of the p-value
sorted list, the resulting effect size is close to 1.

The next panel shows a gene set containing a large number of genes. The
effect size for 
`r sel[2]`
is much smaller
(`r format(foo[sel[2], "AUC"], digits=2)`), 
but thanks to
the large number of genes the enrichment is significant,
with p-value lower than for `r sel[1]`.

Very often, a gene set with a larger number of genes will have a low
p-value. However, these gene sets are frequently not very specific, and the
corresponding effect sizes are not large. In the above example, the
`r sel[1]` is a much more interesting result: it is more specific and shows
a much larger effect, despite having a much higher p-value.






Summary tables
--------------

We can summarize the output from the previously run tests (`tmodUtest`,
`tmodCERNOtest` and `tmodHGtest`) in one table using `tmodSummary`. For
this, we will create a list containing results from all comparisons.

```{r fourC,size="tiny"}
resAll <- list(CERNO=resC, U=resU, HG=resHG)
#head(tmodSummary(resAll))
```

The table below shows the results.

```{r fourC2,results="asis",echo=FALSE,size="tiny"}
tmp <- tmodSummary(resAll)
rownames(tmp) <- NULL
pandoc.table(head(tmp), split.tables=Inf, justify="llrrrrrr")
```


Panel plots
--------------------------------

A list of result tables (or even of a single result table) can be
visualized using a heatmap-like plot called here "panel plot". The idea is
to show both, effect sizes and p-values, and, optionally, also the
direction of gene regulation.

In the example below, we will use the `resAll`  object created above,
containing the results from three different tests for enrichment, to
compare the results of the individual tests. However, since the `E` column
of HG test is not easily comparable to the AUC values (which are between 0
and 1), we need to scale it down. Also, we need to call it "AUC", because
otherwise we can't show the values on the same plot.

```{r panelplots, fig.width=8, fig.height=6}
resAll$HG$AUC <- log10(resAll$HG$E) - 0.5
ggPanelplot(resAll)
```

Each enrichment result corresponds to a reddish bar. The size of the bar
corresponds to the effect size (AUC or log10(Enrichment) - 0.5, as it may be),
and color intensity corresponds to the p-value -- pale colors show p-values
closer to 0.01. It is easily seen how `tmodCERNOtest` is the more
sensitive option.

We can see that also the intercept term is enriched for genes
found in monocytes and neutrophils. Note that by default, ggPanelplot only
shows enrichments with p < 0.01, hence a slight difference from the tmodSummary
output. This behavior can be modified by the `q_thr` option:

```{r panelplots2, fig.width=8, fig.height=5}
ggPanelplot(resAll, q_thr=1e-3)
```

However, one is usually interested in the direction of regulation. If a
gene list is sorted by p-value, the enriched modules may contain both up-
or down-regulated genes^[Searching for enrichment only in up- or only in
down-regulated genes depends on how the gene list is sorted; this is
described in Section "[Testing for up- or down-regulated genes
separately](#updown)".]. It is often desirable to visualize whether genes
in a module go up, or go down between experimental conditions. For this,
the function `tmodDecideTests` is used to obtain the number of
significantly up- or down-regulated genes in a module.

This information must be obtained separately from the differential gene
expression analysis and provided as a list to `ggPanelplot`. The names of
this list must be identical to the names in the results list. Below, we
reuse the same object for all three tests, since the DEGs (differentially
expressed genes) were the same in all three comparisons.

```{r pie, fig.width=10,fig.height=6}
degs <- tmodDecideTests(g=tt$GENE_SYMBOL, lfc=tt$logFC,
                       pval=tt$adj.P.Val)[[1]]
degs <- list(CERNO=degs, HG=degs, U=degs)
ggPanelplot(resAll, sgenes = degs)
```

Each mini-plot shows the effect size of the enrichment and the
corresponding p-value, as before. Additionally, the fraction of
up-regulated and down-regulated genes is visualized by coloring a fraction
of the area of the mini-plot red or blue, respectively^[The colors can be
modified by the parameter `colors`].

The `ggPanelplot` function has several parameters, notably for filtering
and labelling:

 * Filtering:
     * `filter_row_q` and `filter_row_auc` remove gene sets for which,
       respectively, the FDR p-value or the effect size are not achieved in
       any of the comparisons;
     * `q_cutoff`: enrichments with q-values above this cutoff are
       considered to be absent

The `ggPanelplot` function returns a ggplot2 graph, and therefore allows
much customization.







Using tmod for other types of GSE analyses
==============================================================

The fact that tmod relies on a single ordered list of genes makes it useful
in many other situations in which such a list presents itself. 

Correlation analysis
--------------------

Genes can be ordered by their absolute correlation with a variable or even
a data set or a module. For example, one can ask the question about a
function of a particular unknown gene -- such as ANKRD22, annotated as
"ankyrin repeat domain 22". 

```{r ankrd22}
x <- E[ match("ANKRD22", Egambia$GENE_SYMBOL), ]
cors <- t(cor(x, t(E)))
ord <- order(abs(cors), decreasing=TRUE)
head(tmodCERNOtest(Egambia$GENE_SYMBOL[ ord ]))
```

Clearly, ANKRD22 correlates to other immune related genes, most of all
these which are interferon inducible.

In another example, consider correlation between genes and the first
principal component ("eigengene") of a group of genes of unknown
function^[More on eigengenes in the Chapter on modules]. To demonstrate the
method, we will select the genes from the module "LI.M75". For this, we use
the function `getGenes` with the optional argument `genes` used to
filter the genes in the module by the genes present in the data set.

```{r m75}
g <- getGenes("LI.M75", genes=Egambia$GENE_SYMBOL, 
              as.list=TRUE)
x <- E[ match(g[[1]], Egambia$GENE_SYMBOL), ]

## calculating the "eigengene" (PC1)
pca <- prcomp(t(x), scale.=T)
eigen <- pca$x[,1]
cors <- t(cor(eigen, t(E)))

## order all genes by the correlation between the gene and the PC1
ord <- order(abs(cors), decreasing=TRUE)
head(tmodCERNOtest(Egambia$GENE_SYMBOL[ ord ]))
```




Functional multivariate analysis
--------------------------------

Transcriptional modules can help to understand the biological meaning of
the calculated multivariate transformations. For example, consider a
principal component analysis (PCA), visualised using the pca3d package
[@pca3d]:


```{r six,fig=TRUE,fig.width=8,fig.height=5}
mypal <- c("#E69F00", "#56B4E9")
pca <- prcomp(t(E), scale.=TRUE)

col <- mypal[ factor(group) ]
par(mfrow=c(1, 2))
l<-pcaplot(pca, group=group, col=col)
 
legend("topleft", as.character(l$groups),
       pch=l$pch,
       col=l$colors, bty="n")
l<-pcaplot(pca, group=group, col=col, components=3:4)
legend("topleft", as.character(l$groups),
       pch=l$pch,
       col=l$colors, bty="n")
```


The fourth component looks really interesting. Does it correspond to the
modules which we have found before? Each principal component is, after all,
a linear combination of gene expression values multiplied by weights (or
scores) which are 
constant for a given component. The *i*-th principal component for
sample *j* is given by

$$PC_{i,j} = \sum_{k} w_{i,k} \cdot x_{k,j}$$

where $k$ is the index of the variables (genes in our case), $w_{i,k}$ is
the weight associated with the $i$-th component and the $k$-th variable
(gene), and $x_{k,j}$ is the value of the variable $k$ for the sample $j$;
that is, the gene expression of gene $k$ in the sample $j$. Genes influence
the position of a sample along a given component the more the larger their
absolute weight for that component.

For example, on the right-hand figure above, we see that samples which
were taken from TB patients have a high value of the principal component 4;
the opposite is true for the healthy controls.  The genes that allow us to
differentiate between these two groups will have very large, positive weights for genes
highly expressed in TB patients, and very large, negative weights for genes
which are highly expressed in NID, but not TB.

We can sort the genes by their weight in
the given component, since the weights are stored in the pca object in the "rotation"
slot, and use the tmodUtest function to test for enrichment of the modules.


```{r seven}
o <- order(abs(pca$rotation[,4]), decreasing=TRUE)
l <- Egambia$GENE_SYMBOL[o]
res <- tmodUtest(l)
head(res)
```


Perfect, this is what we expected: we see that the neutrophil / interferon
signature which is the hallmark of the TB biosignature. What about other
components? We can run the enrichment for each component and visualise the
results using tmod's functions tmodSummary and ggPanelplot. Below, we use
the filter.empty option to omit the principal components which show no
enrichment at all.


```{r pcasum}
# Calculate enrichment for each component
gs   <- Egambia$GENE_SYMBOL
# function calculating the enrichment of a PC
gn.f <- function(r) {
    tmodCERNOtest(gs[order(abs(r), decreasing=T)],
                qval=0.01)
}
x <- apply(pca$rotation, 2, gn.f)
tmodSummary(x, filter.empty=TRUE)[1:5,]
```


The following plot shows the same information in a visual form. The size of
the blobs corresponds to the effect size (AUC value), and their color -- to
the q-value.


```{r pcsum2,fig=TRUE,fig.width=8,fig.height=5,results="hide"}
ggPanelplot(x)
```


However, we might want to ask, for each module, how many of the genes in that
module have a negative, and how many have a positive weight?  We can use the
function tmodDecideTests for that. For each principal component shown, we want
to know how many genes have very large (in absolute terms) weights -- we can use
the "lfc" parameter of tmodDecideTests for that. We define here "large" as being
in the top 25% of all weights in the given component. For this, we need first
to calculate the 3rd quartile (top 25% threshold). We will show only 10
components:



```{r pcsum3,fig=TRUE,fig.width=8,fig.height=5}
qfnc <- function(r) quantile(r, 0.75)
qqs <- apply(pca$rotation[,1:10], 2, qfnc)
gloadings <- tmodDecideTests(gs, lfc=pca$rotation[,1:10], lfc.thr=qqs)
ggPanelplot(x[1:10], sgenes = gloadings) 
```







Using and creating modules and gene sets
==============================================================

Tmod was created with transcriptional modules in mind. This is why the word
"module" is used throughout tmod. However, any gene or variable set --
depending on application -- is a "module" in tmod. These data sets can be
used with most of tmod functions (including the gene set enrichment test
functions) by specifying it with the option `mset=`, for example
`tmodCERNOtest(..., mset=mytmodobject)`.

Using built-in gene sets (transcriptional modules)
--------------------------------------------------------

By default, tmod uses both the modules published by Li et al.
[@li2014molecular] (LI) and second set of modules published by Chaussabel
et al. [@chaussabel2008modular] (DC). The module definitions for the DC set were described
by Banchereau et al. [@banchereau2012host].

Depending on the `mset` parameter to the test functions, either the LI
or DC sets are used, or both, if the  `mset=all` has been specified.


```{r fiveB}
l    <- tt$GENE_SYMBOL
res2 <- tmodUtest(l, mset="LI")
head( res2 )
```

As you can see, the information contained in both module sets is partially redundant. 




Accessing the tmod gene set data
---------------------------------------

The `tmod` package stores its data in two data frames and two lists.
This object is contained in a list called `tmod`, which is loaded
with `data("tmod")`. The names mimick the various environments from
Annotation.dbi packages, but currently the objects are just two lists and
two data frames.

 * **`tmod$gs`** is a data frame which contains general module
information as defined in the supplementary materials for Li et al.
[@li2014molecular] and Chaussabel et al. [@chaussabel2008modular]
 * **`tmod$gv`** is vector with gene identifiers;
 * **`tmod$gs2gv`** contains the mapping between gene sets and genes.

### Module operations

The gene sets used by tmod are objects of class `tmod`. The default object
used in the gene set enrichment tests in the tmod package can be loaded
into the environment with the command `data(tmod)`:

```{r tmodobj}
data(tmod)
tmod
```

Objects of the class tmod can be easily generated from a number of data
sources (see below). Several functions can be used on the objects:

```{r tmodobj2}
length(tmod)
sel <- grep("Interferon", tmod$gs$Title, ignore.case=TRUE)
ifn <- tmod[sel]
ifn
length(ifn)
```


### Custom module definitions


It is possible to use any kind of arbitrary or custom gene set definitions.
These custom definition of gene sets takes form of a list which is then
provided as the `mset` parameter to the test functions. The list in
question must have the following members:

 * **gs** (gene sets) A data frame which contains at least the columns "ID" and
"Title".
 * **gs2gv** A list. Mapping between gene sets and the gene vector. Each
   element is an *integer* vector which contains the positions of the given
   gene in the *gv* vector. The names of the list correspond to the
   **gs$ID** vector.
 * **gv** Gene vector. Character vector of genes.
   
The tests in the tmod package will accept a simple list that contains the
above fields. However, the function `makeTmodGS` can be used conveniently to
create a tmod object. This function requires only two parameters: `gs` – a
data frame, as described above – and a mapping between gene sets *and gene
identifiers*, in the parameter `gs2gene`. 

Here is a minimal definition of such a set:

```{r exmset}
mymset <- makeTmodGS(
  gs=data.frame(ID=c("A", "B"),
             Title=c("A title", 
                      "B title")),
  gs2gene=list(
    A=c("G1", "G2"),
    B=c("G3", "G4"))
)
mymset
```

Whether the gene IDs are Entrez, or something else entirely does not
matter, as long as they matched the provided input to the test functions.


Obtaining other gene sets
--------------------------------------------------------

The tests in the tmod package can take any arbitrary module definitions.
While tmod -- for many reasons -- cannot distribute all module sets, it can
easily import gene sets from many sources. A few of these will be discussed
below.


### MSigDB

The MSigDB database from the Broad institute is an interesting collection of
gene sets (actually, multiple collections), including Reactome pathways,
gene ontologies (GO) and many other data sets.  Moreover, it is the basis
for the GSEA program.

The whole MSigDB is provided by the
[msigdbr](https://CRAN.R-project.org/package=msigdbr) package from
BioConductor. We can then use tmod function `makeTmodFromDataFrame` to 
convert the msigdbr data frame into one large tmod object:


```{r eval=FALSE}
library(msigdbr)
msig <- msigdbr()
msig <- makeTmodFromDataFrame(df=msig, 
  feature_col="gene_symbol",
  module_col="gs_id", title_col="gs_name", 
  extra_module_cols=c("gs_cat", "gs_subcat", "gs_url", 
                      "gs_exact_source", "gs_description"))
```




Alternatively, you can the MSigDB in XML
format^[Note that even if you register with MSig, it is not possible to
download the database directly from R in the XML format.].
This
file can be accessed at the from
the pages of MSigDB [at the Broad
Institute](http://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2022.1.Hs/msigdb_v2022.1.Hs_chip_files_to_download_locally.zip)
http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.1/msigdb_v6.1.xml
-- follow the link, register and log in, and save the zip archive on your disk (roughly 113 MB). 
The ZIP file contains an XML file (called 'msigdb_v2022.1.Hs.xml' at the
time of writing) which you can import into tmod.


Importing MSigDB from XML is easy -- tmod has a function specifically for that
purpose.  Once you have downloaded the MSigDB file, you can create the
tmod-compatible R object with one command^[MSigDB gene sets can be also downloaded as "GMT" files. This format
contains less information and is therefore less usable.]. However, the tmod
function tmodImportMsigDB() can also use this format, look up the manual
page:


```{r msigdbparse1, eval=FALSE}
msig <- tmodImportMSigDB("msigdb_v2022.1.Hs.xml")
```


That's it -- now you can use the full MSigDB for enrichment tests:


```{r msigdb6,eval=FALSE}
res <- tmodCERNOtest(tt$GENE_SYMBOL, mset=msig)
```

The results are quite typical for MSigDB, which is quite abundant with
similar or overlapping gene sets. As the first results, we see, again, interferon
response, as well as sets of genes which are significantly upregulated
after yellow fever vaccination -- and which are also interferon related. We
might want to limit our analysis only to the 50 "hallmark" module
categories:


```{r msigdb7,eval=FALSE}
sel <- msig$gs$gs_cat == "H"
tmodCERNOtest(tt$GENE_SYMBOL, mset=msig[sel] )
```

Other particularly interesting subsets of MSigDB are shown in the table below.
"Category" and "Subcategory" are columns in the `msig$gs` data frame.

Subset       Description                                               Category Subcategory
------------ --------------------------------------------------------- -------- -----------
Hallmark     Curated set of gene sets                                  H
GO / BP      Gene ontology, biological process                         C5       BP
GO / CC      Gene ontology, cellular component                         C5       CC
GO / MF      Gene ontology, molecular function                         C5       MF
Biocarta     Molecular pathways from Biocarta                          C2       CP:BIOCARTA
KEGG         Pathways from Kyoto Encyclopedia of Genes and Genomes     C2       CP:KEGG
Reactome     Pathways from the Reactome pathway database               C2       CP:REACTOME


### Using the ENSEMBL databases through biomaRt

ENSEMBL databases for a multitude of organisms can be accessed using the R package `biomaRt`.  
Importantly, biomaRt allows to map different types of identifiers onto each
other; this allows for example to obtain Entrez gene identifiers (required
by KEGG or GO) .

Below, we will use biomaRt to obtain gene ontology (GO) terms and Reactome
pathway IDs for genes in the `Egambia` data set, using the Entrez gene ID's
(column `EG` in the `Egambia` data set). 

```{r biomart,eval=FALSE}
library(biomaRt)
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
bm <- getBM(filters="hgnc_symbol", 
            values = Egambia$GENE_SYMBOL,
            attributes = c( "hgnc_symbol", "entrezgene", "reactome", "go_id", "name_1006", "go_linkage_type"),
            mart=mart)
```

In the following code, we construct the modules data frame `m` and the gene
set to gene mappings `m2g` (each twice: once for GO, and once for
Reactome). We only keep the terms that have at least 10 and not more than
100 associated Entrez gene ID's.

```{r biomart2,eval=FALSE}
m2g_r <- with(bm[ bm$reactome != "", ], split(hgnc_symbol, reactome))
m2g_g <- with(bm[ bm$go_id != "", ], split(hgnc_symbol, go_id))

ll <- lengths(m2g_r)
m2g_r <- m2g_r[ ll >= 5 & ll <= 250 ]
ll <- lengths(m2g_g)
m2g_g <- m2g_g[ ll >= 5 & ll <= 250 ]

m_r <- data.frame(ID=names(m2g_r), Title=names(m2g_r))
m_g <- data.frame(ID=names(m2g_g), 
  Title=bm$name_1006[ match(names(m2g_g), bm$go_id)])

ensemblR  <- makeTmod(modules=m_r, modules2genes=m2g_r)
ensemblGO <- makeTmod(modules=m_g, modules2genes=m2g_g)

## these objects are no longer necessary
rm(bm, m_g, m_r, m2g_r, m2g_g)
```


### Gene ontologies (GO)

GO terms are perhaps the most frequently used type of gene sets in GSE, in
particular because they are available for a much larger number of organisms
than other gene sets (like KEGG pathways). 

There are many sources to obtain GO definitions.  As described in the
previous sections, GO's can be also obtained from ENSEMBL via biomaRt and
from MSigDB. In fact, MSigDB may be the easiest way.  

However, GO annotations can also be obtained from
AnnotationDBI Bioconductor packages as shown below. Note that the Entrez
gene IDs are in the `EG` column of the `Egambia` object.

```{r orghs,eval=FALSE}
library(org.Hs.eg.db)
mtab <- toTable(org.Hs.egGO)
```

There are over 15,000 GO terms and 250,000 genes in the `mtab` mapping;
however, many of them map to either a very small or a very large number of
genes. At this stage, it could also be useful to remove any genes not
present in our particular data set, but that would make the resulting tmod
object less flexible. However, we may be interested only in the "biological
process" ontology for now.

```{r orghs2,eval=FALSE}
mtab <- mtab[ mtab$Ontology == "BP", ]
m2g <- split(mtab$gene_id, mtab$go_id)
## remove the rather large object
rm(mtab) 
ll <- lengths(m2g)
m2g <- m2g[ ll >= 10 & ll <= 100 ]
length(m2g)
```

Using the mapping and the `GO.db` it is easy to create a module set
suitable for tmod:

```{r orghs3,eval=FALSE}
library(GO.db)
gt <- toTable(GOTERM)
m <- data.frame(ID=names(m2g))
m$Title <- gt$Term[ match(m$ID, gt$go_id) ]

goset <- makeTmod(modules=m, modules2genes=m2g)
rm(gt, m2g, m)
```

This approach allows an offline mapping to GO terms, assuming the necessary
DBI is installed.  Using AnnotationDBI databases such as `org.Hs.eg.db` has,
however, also two major disadvantages: firstly, the annotations are available
for a small number of organisms. Secondly, the annotations in ENSEMBL may
be more up to date.


### KEGG pathways

One way to obtain KEGG pathway gene sets is to use the MSigDB as described
above. However, alternatively and for other organisms it is possible to
directly obtain the pathway definitions from KEGG. The code below might
take a lot of time on a slow connection.

```{r kegg, eval=FALSE}
library(KEGGREST)
pathways <- keggLink("pathway", "hsa")

## get pathway Names in addition to IDs
paths    <- sapply(unique(pathways), function(p) keggGet(p)[[1]]$NAME)
m <- data.frame(ID=unique(pathways), Title=paths)

## m2g is the mapping from modules (pathways) to genes
m2g <- split(names(pathways), pathways)

## kegg object can now be used with tmod
kegg <- makeTmod(modules=m, modules2genes=m2g)
```

Note that KEGG uses a slightly modified version of Entrez identifiers --
each numeric identifier is preceded by a three letter species code (e.g.
"hsa" for humans) followed by a colon:

```{r kegg2, eval=FALSE}
eg <- paste0("hsa:", tt$EG)
tmodCERNOtest(eg, mset="kegg")
```

Again, the most important part is to ensure that the gene identifiers in
the tmod object (`kegg` in this case) correspond to the gene identifiers in
in the ordered list.

### Manual creation of tmod module objects: MSigDB

For the purposes of an example, the code below shows how to parse the XML
MSigDB file using the R package XML. Essentially, this is the same code that
tmodImportMsigDB is using:

```{r msigdb1,eval=FALSE}
library(XML)
foo  <- xmlParse( "msigdb_v2022.1.Hs.xml" )
foo2 <- xmlToList(foo)
```


There are over 30,000 "gene sets" (equivalent to modules in tmod) defined.
Each member of foo2 is a named character vector:


```{r msigdb2,eval=FALSE}
path1 <- foo2[[1]]
class(path1)
```


```{r msigdb2b,eval=FALSE}
names(path1)
```

For our example analysis, we will use only human gene sets. We further need
to make sure there are no NULLs in the list.


```{r msigdb3,eval=FALSE}
orgs <- sapply(foo2, function(x) x["ORGANISM"])
unique(orgs)

foo3 <- foo2[ orgs == "Homo sapiens" ]
foo3 <- foo3[ ! sapply(foo3, is.null) ]
```


Next, construct the modules data frame. We will use four named fields for each
vector, which contain the ID (systematic name), description, category and
subcategory:


```{r msigdb4,eval=FALSE}
modules <- t(sapply(foo3, 
  function(x) 
    x[ c("SYSTEMATIC_NAME", "STANDARD_NAME", "CATEGORY_CODE", "SUBCATEGORY_CODE") ])) 
colnames(modules) <- c( "ID", "Title", "Category", "Subcategory" )
modules <- data.frame(modules, stringsAsFactors=FALSE)
nrow(modules)
```


Then, we create the mapping from gene sets to genes. For
this, we use the `MEMBERS_SYMBOLIZED` field, which is a comma separated
list of gene symbols belonging to a particular module:


```{r msigdb5,eval=FALSE}
m2g <- lapply(foo3, 
  function(x) strsplit( x["MEMBERS_SYMBOLIZED"], "," )[[1]])
names(m2g) <- modules$ID

mymsig <- makeTmod(modules=modules, modules2genes=m2g)
mymsig
```

From now on, you can use the object mymsig with tmod enrichment tests.



Citing
====================

To cite tmod, please use the following reference:

Zyla J, Marczyk M, Domaszewska T, Kaufmann SH, Polanska J, Weiner 3rd J.
Gene set enrichment for reproducible science: comparison of CERNO and eight
other algorithms. Bioinformatics. 2019 Dec 15;35(24):5146-54.



References
================
