---
title: "Analysis of single-cell RNA-seq data using a topic model, Part 1: basic concepts"
author: Peter Carbonetto
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float: no
    highlight: textmate
    theme: readable
vignette: >
  %\VignetteIndexEntry{Analysis of single-cell RNA-seq data using a topic model, Part 1: basic concepts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The aim of this vignette is to introduce the basic concepts behind an
analysis of single-cell RNA-seq data using a topic model, and to show
how to use 'fastTopics' to implement this analysis. This first
vignette explains the analysis steps at only a high level---see
[Part 2][vignette-part-2] for additional explanations and
guidance.

Since a topic model analysis is quite different from most conventional
analyses of single-cell RNA-seq data, we point out key
differences.

One important difference is that a topic model is a model of count
data, so *the topic model should be applied directly to the count
data.* In contrast, many methods require preprocessing of the count
data.

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
                      fig.align = "center",dpi = 120)
```

We begin our analysis by loading the packages. Then we set the seed so
that the results can be reproduced.

```{r load-pkgs, message=FALSE, warning=FALSE}
library(Matrix)
library(fastTopics)
library(ggplot2)
library(cowplot)
set.seed(1)
```

The example data set
--------------------

We will illustrate the concepts using a single-cell RNA-seq data set
from [Zheng *et al* (2017)][zheng-2017]. These data are reference
transcriptome profiles from 10 bead-enriched subpopulations of
peripheral blood mononuclear cells (PBMCs). The original data set is
much larger---for this introduction, we have taken a small subset of
approximately 3,700 cells.

The data we will analyze are unique molecular identifier (UMI)
counts. These data are stored as an $n \times m$ sparse matrix, where
$n$ is the number of cells and $m$ is the number of genes:

```{r load-data}
data(pbmc_facs)
counts <- pbmc_facs$counts
dim(counts)
```

The UMI counts are "sparse"---that is, most of the counts are
zero. Indeed, over 95% of the UMI counts are zero:

```{r nonzeros}
mean(counts > 0)
```

No need to normalize or transform
---------------------------------

Most analyses of single-cell RNA-seq data involve a preprocessing step
in which the UMI counts are log-transformed and normalized. *When
analyzing UMI counts with a topic model, you should not normalize or
transform the counts.*

Additionally, analyses of single-cell RNA-seq data typically select
only the most highly variable genes. *We recommend instead using all
or most genes.* The exception are genes in which all the UMI counts
are zero---these genes should be removed prior to a topic model
analysis.

Fit the topic model
-------------------

Since no pre-processing is needed, we can move directly to the next
step of the analysis: fitting the topic model to the UMI count
data. This is accomplished with a call to `fit_topic_model`:

```{r fit-topic-model, eval=FALSE}
fit <- fit_topic_model(counts,k = 6)
```

Note it may take several minutes to fit the topic model on this data
set. For convenience, we saved the output from this call:

```{r load-fit}
fit <- pbmc_facs$fit
```

To fit a topic model, we must specify $K$, the number of topics.
Here, we have chosen $K = 6$ topics. In most settings, a good choice
of $K$ will not be known in advance, so you will you want to explore
the results from topic models at different settings of $K$.

The `fit_topic_model` interface is intended to hide the details of
model fitting, and it should work well for many data sets. Larger or
more complex data sets my require some fine-tuning of the model
fitting. See [Part 2][vignette-part-2] for more on this.

Each cell is represented as a unique combination of topics
----------------------------------------------------------

A key feature of the topic model is that each cell $i$ is represented
as a *unique combination of the topics.* Therefore, each cell can be
summarized by the $K$ topic proportions. These cell-specific
proportions are learned from the data. In 'fastTopics', the topic
proportions for all cells are stored in an $n \times K$ matrix:

```{r loadings-1}
dim(fit$L)
```

To illustrate, here is a cell in which the pattern of expression is
almost fully captured by the fourth topic:

```{r loadings-2}
rows <- "GATATATGTCAGTG-1-b_cells"
round(fit$L[rows,],digits = 3)
```

Here are two more examples:

```{r loadings-3}
rows <- c("GACAGTACCTGTGA-1-memory_t",
          "TGAAGCACACAGCT-1-b_cells")
round(fit$L[rows,],digits = 3)
```

The last example is interesting because the observed expression is
best captured by a combination of topics 4 and 6.

To make sense of these results, we first need to understand the
biological relevance of the topics, which we explore next.

Interpreting topics using available cell labels
-----------------------------------------------

In some cases, you may have additional information about the cells,
such as the tissue the cells were sampled from. (For interpreting
topics when cells are not labeled, see [Part 2][vignette-part-2].)  In
the PBMC data, the cells were labeled using fluorescence-activated
cell sorting (FACS). Each cell is assigned one of five labels (each
corresponding to a cell type): B cells, CD14+ monocytes, CD34+ cells,
natural killer (NK) cells, and T cells.

```{r summary-subpop}
samples <- pbmc_facs$samples
summary(samples$subpop)
```

Create a "Structure plot" to visualize the relationship between the
cell labels and the topic proportions:

```{r structure-plot-with-celltype-labels, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue",
                  "gold","darkorange")
structure_plot(fit,colors = topic_colors,topics = 1:6,gap = 25,
               grouping = samples$subpop)
```

The Structure plot is a stacked bar chart in which each topic is
represented as a bar of a different colour, and the bar heights are
the topic proportions. Patterns begin to emerge when the cells are
arranged so that cells with similar topic proportions are positioned
close to each other. This arrangement is automated in `structure_plot`.

From the Structure plot, we see that topics 1 (light blue), 2 (green)
and 4 (dark blue) closely correspond to NK, CD14+ and B cell
types, respectively; expression in these cells is largely
explained by a single topic.

The topic model also captures interesting substructure within the T
cells: most T cells are explained by at least two topics, with wide
variation in the proportions for these two topics; also, there is a
subset of T cells that are explained by three topics (topics 1, 5 and
6).

Predicting topics in other cells
--------------------------------

Having fitted a topic model to a single-cell data set, we can use that
model to predict topics for cells that were not used to train the
model. We illustrate this with a separate collection of 100 cells
drawn from the same PBMC data set:

```{r test-data}
counts_test <- pbmc_facs$counts_test
dim(counts_test)
```

The "predict" function estimates the topic proportions based on a
previously fit topic model.

```{r predict, results="hide", message=FALSE}
Ltest <- predict(fit,counts_test)
```

The predictions appear to align well with the provided cell labels:

```{r structure-plot-test, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
fit_test <- list(F = fit$F,L = Ltest)
class(fit_test) <- c("multinom_topic_model_fit","list")
structure_plot(fit_test,topics = 1:6,colors = topic_colors,gap = 2,
               grouping = pbmc_facs$samples_test$subpop)
```

Topics capture patterns of relative expression
----------------------------------------------

Each topic is represented as a vector of $m$ expression levels which
are stored as an $m \times K$ matrix:

```{r factors-1}
dim(fit$F)
```

These expression levels are comparable across topics. For example,
genes *CD79A* are *CD79B* are important to B cells, so we would expect
higher expression in topic 4:

```{r factors-2}
genes <- pbmc_facs$genes
rbind(fit$F[genes$symbol == "CD79B",],
      fit$F[genes$symbol == "CD79A",])
```

The most informative genes are those with higher expression in one
topic compared to the other topics. In the next section we explain how
to identify these genes.

Annotating topics by differentially expressed genes
---------------------------------------------------

Consider a standard differential expression (DE) analysis of cell
types: for a given cell type, the log-fold change (LFC) is estimated
as the (base-2) log-ratio of two expression levels, the expression in
cells belonging to the cell type over the expression in cells not
belonging to the cell type.

'fastTopics' extends the standard DE analysis to allow for *partial
membership* to multiple groups (the methods are described
[here][single-cell-topics-paper]). This "grade of membership" (GoM)
differential expression analysis is implemented by the
function `de_analysis`:

```{r de-analysis-1, eval=FALSE}
set.seed(1)
de <- de_analysis(fit,counts,pseudocount = 0.1,
                  control = list(ns = 1e4,nc = 4))
```

Since this computation can take 10 minutes or more to run, we have
provided the output of this `de_analysis` call in the `pbmc_facs` data
set. Since the `de_analysis` output is quite large, we have made this
available outside the package:

```{r de-analysis-2}
pbmc_facs_file <- tempfile(fileext = ".RData")
pbmc_facs_url <- "https://stephenslab.github.io/fastTopics/pbmc_facs.RData"
download.file(pbmc_facs_url,pbmc_facs_file)
load(pbmc_facs_file)
de <- pbmc_facs$de
```

The GoM DE analysis generates LFC estimates that have a similar
interpretation to a standard DE analysis. To illustrate, the GoM DE
analysis estimates strong overexpression of B-cell genes *CD79A* and
*CD79B* in topic 4:

```{r diff-count-analysis-3}
rbind(de$postmean[genes$symbol == "CD79A",],
      de$postmean[genes$symbol == "CD79B",])
```

Ultimately, we would like to discover genes most relevant to a topic,
not just report LFC estimates for known genes. A common approach is to
visualize the results of the DE analysis using a volcano plot. For
example, here is the volcano plot for topic 4:

```{r volcano-plot-bcells, fig.height=4, fig.width=4.5}
volcano_plot(de,k = 4,labels = genes$symbol)
```

Typically, the most interesting genes are found on the right-hand side
of the volcano plot---that is, genes with large LFC---so long as they
have a small lfsr (local false sign rate). Indeed, B cell gene *CD79A*
is on the far right-hand side of the plot.

Likewise, natural killer genes such as *NKG7* and *GNLY* emerge on the
right-hand side of the volcano plot for topic 1:

```{r volcano-plot-nk, fig.height=4, fig.width=4.5}
volcano_plot(de,k = 1,labels = genes$symbol,ymax = 100)
```

Some of the genes with the largest LFCs are *CD3D*, *CD3E* and *CD8B*
which suggest topic 6 is capturing expression specific to T cells:

```{r volcano-plot-tcells, fig.height=4, fig.width=4.5}
volcano_plot(de,k = 6,labels = genes$symbol,ymax = 100)
```

See also [Part 2][vignette-part-2] for more DE analysis examples.

Session info
------------

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[vignette-part-2]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_practical.html
[zheng-2017]: https://doi.org/10.1038/ncomms14049
[single-cell-topics-paper]: https://doi.org/10.1101/2023.03.03.531029 
