---
title: "Identifying and removing the cell-cycle effect from single-cell 
RNA-Sequencing data - the ccRemover package"
author: "Martin Barron, Jun Li"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{sparseDC}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
---  
references:
- id: buettner15
  title: Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells
  author:
  - family: Buettner
    given: Florian
  - family: Natarajan
    given: Kedar N.
  - family: Casale
    given: F. Paolo
  - family: Proserpio
    given: Valentina
  - family: Scialdone
    given: Antonio
  - family: Theis
    given: Fabian J.
  - family: Teichmann
    given: Sarah A.
  - family: Marioni
    given: John C.
  - family: Stegle
    given: Oliver
  container-title: Nature Biotechnology
  volume: 33
  URL: 'http://www.nature.com/nbt/journal/v33/n2/abs/nbt.3102.html'
  DOI: 10.1038/nbt.3102
  issue: 2
  publisher: Nature Publishing Group
  page: 155-160
  type: article-journal
  issued:
    year: 2015
    month: 2
    
- id: mahata14
  title: Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids De Novo to Contribute to Immune Homeostasis
  author:
  - family: Mahata
    given: Bidesh
  - family: Zhang
    given: Xiuwei
  - family: Kolodziejczyk
    given: Aleksandra A
  - family: Proserpio
    given: Valentina
  - family: Haim-Vilmovsky
    given: Liora
  - family: Taylor
    given: Angela E
  - family: Hebenstreit
    given: Daniel
  - family: Dingler
    given: Felix A
  - family: Moignard
    given: Victoria
  - family: G?ttgens
    given: Berthold
  - family: Arlt
    given: Wiebke
  - family: McKenzie 
    given: Andrew NJ
  - family: Teichmann
    given: Sarah A.
  container-title: Cell Reports
  volume: 7
  URL: 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4039991/'
  DOI: 10.1016/j.celrep.2014.04.011
  issue: 4
  page: 1130-1142
  type: article-journal
  issued:
    year: 2014
    month: 5
    
- id: barron16
  title: Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data
  author:
  - family: Barron
    given: Martin
  - family: Li
    given: Jun
 
  container-title: Scientific Reports
  volume: 6
  URL: 'https://www.nature.com/articles/srep33892'
  DOI: 10.1038/srep33892
  type: article-journal
  issued:
    year: 2016
    month: 9
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ccRemover)
set.seed(10)
```

**Abstract**

Measurements of expression in single-cell RNA-Sequencing often suffer from a 
large systematic bias. A major source of this bias is the cell-cycle which 
introduces within-cell-type heterogeneity that can obscure the differences 
in expression between cell-types. The package *ccRemover* provides a method 
for identifying and removing the effects of the cell-cycle from single-cell
RNA-Seq data. This vignette explains the use of the ccRemover r-package which
implements the ccRemover algorithm Barron and Li [-@barron16].



## 1 Preliminaries

---


### 1.1  Normalized Data Matrix

As input the ccRemover package expects a normalized, log-transformed and 
centered matrix of gene expression measurements for single cells obtained
from single-cell RNA-Seq (scRNA-Seq) or another high-throughput sequencing 
experiment. The genes should correspond to the rows and the cells to the 
columns, that is the measurement in the $i$-th row and the $j$-th column 
of the matrix contains the expression of gene $i$ for cell $j$. 

It is important that the data is normalized for sequencing depth prior to
analysis or this may interfere with the identification of the cell-cycle 
components. From experience we have found that log-transforming the data 
prior to analysis leads to better results and more accurate cell-cycle removal.
The data should be centered on a gene-by-gene basis, this prevents any small
set of genes from dominating the principal components used to capture the 
cell-cycle effect. 

To demonstrate the preprocessing steps and the application of ccRemover we 
will use the differentiating T-helper cell data set that is presented as the
first example in the original paper and was retrieved from the supplementary
material of Buettner et al. [-@buettner15]. It was originally generated by 
Mahata B. et al. [-@mahata14] This data set contains normalized and log 
transformed gene expression measurements for 81 cells and 14,147 genes. 
First we must load the data into R.

```{r load data1}
data(t.cell_data)
```

To check the data is in the appropriate format we can view the file with a 
text editor or use r. 

```{r data}
head(t.cell_data[,1:5])
```

Here the data has genes as rows and cells as columns which is as required
for ccRemover. If you have data which is in the opposite format then it 
must be transposed prior to applying ccRemover. This data is also already
normalized and log-transformed. Next we must check that the data has been
centered on a gene-by-gene basis. This is easy to see from a summary of 
the row means.

```{r center check}
summary(apply(t.cell_data,1, mean))
```

This data is not centered as can be seen from the non-zero row means above.
We will center this data on a gene-by-gene basis and then recheck the row 
means to ensure the data has been centered correctly. The mean values will also
be stored so that they can be added back into the cleaned data matrix

```{r Center the data}
mean_gene_exp <- rowMeans(t.cell_data)
t_cell_data_cen <- t.cell_data - mean_gene_exp
summary(apply(t_cell_data_cen,1,mean))

```

From the summary of the row means we can see that the centering has been 
effective and the genes now have mean 0 (or very close to it). 

## 1.2  The cell-cycle genes
In order to remove the cell-cycle effect from the data ccRemover uses sets
of genes annotated to the cell-cycle and not annotated to the cell-cycle. 
ccRemover uses the set of genes which are not annotated to the cell-cycle 
to identify the main effects that are present in the data. It then uses 
the cell-cycle annotated genes to identify which of the effects are related
to the cell-cycle before removing them. 

These genes can be identified using your own methods or using ccRemover's 
built-in cell-cycle gene identifier. This information is entered into ccRemover
as a vector of length $n$, 'if_cc', where $n$ is the number of genes in 
the data matrix and 'if_cc[i] == TRUE' if gene $i$ is a gene annotated 
to the cell-cycle or 'if_cc[i] == FALSE' if gene $i$ is not annotated 
to the cell-cycle. In other words 'if_cc' is an indicator vector indicating
whether a gene is a cell-cycle gene or not.

If you are using ccRemover to remove another source of unwanted variation in 
the data that is not the cell-cycle then you will need to determine yourself 
which genes are effected by the feature which you are interested in removing.
Thus the vector will have entries 'if_cc[i] == TRUE' if gene $i$ is affected 
by the feature of interest and 'if_cc[i] == FALSE' if gene $i$ is not affected
by the feature of interest. 

To identify the cell-cycle genes in your own data set using ccRemovers built 
in function we recommend following the procedure below. First extract the gene
names from the data matrix:

```{r Extract Gene Names}
gene_names <- rownames(t_cell_data_cen)
```

Next we use ccRemovers built-in function to identify the indices of the 
cell-cycle genes from the data. This function requires the additional arguments
'species' and 'name_type', these arguments correspond to the species
of the samples and the format of the gene ID's respectively. Currently 
ccRemover is able to identify genes from Homo Sapiens ('"human"')
and Mus Musculus ('"mouse"'). In addition it is able to function using
Ensembl Gene IDs ('"ensembl"'), HGNC/MGI symbols ('"symbol"'), Entrez
Gene IDs ('"entrez"') and Unigene IDs ('"unigene"'). If your gene 
ID's are not in one of these formats then they must be converted prior to using
ccRemover's built-in gene indexer, Biomart provides a great data repository and
user interface from which it is possible to download the data necessary for
Gene ID conversion. If your names are of a different type but you have 
identified your own set of cell-cycle related genes and just need to use the
main ccRemover function then ccRemover only needs to be provided with the
vector described above and no ID conversion is necessary.

This function identifies which of the gene names are annotated to the cell-cycle
and returns a list of the indices of the cell-cycle genes. As the samples in 
this data set are from a mouse and the gene names are MGI symbols we select 
'species = "mouse"' and 'name_type = "symbols"'. 

```{r Identify cell-cycle genes}
cell_cycle_gene_indices <- gene_indexer(gene_names, species = "mouse", 
                                        name_type = "symbols" )
length(cell_cycle_gene_indices)
```

Here we have identified 751 of the 7,073 genes as cell-cycle related genes. 

Finally we must create the vector which will be used in the main ccRemover 
procedure. This step should be followed if you have calculated your own gene 
indices as well. Just sub in your own set of indices in place of 
'cell_cycle_gene_indices' in the code below. We first create a vector of 
'FALSE' values of length 'n' to store the values and then change
the entries corresponding to the cell-cycle genes to 'TRUE'. Finally we
summarize the vector to ensure the correct number of cell-cycle genes has been
assigned.

```{r cell-cycle vector}
if_cc <- rep(FALSE,nrow(t_cell_data_cen)) 
if_cc[cell_cycle_gene_indices] <- TRUE
summary(if_cc)
```

Here we see that there are 751 values of 'TRUE' in the 'if_cc' vector and 
6,322 values of 'FALSE' as expected.

## 1.3  Putting it Together

Now that we have our normalized, centered and log-transformed data matrix
along with our true/false vector indicating which genes are cell-cycle genes
we must put them into a list prior to applying the main ccRemover function.
The data matrix in the format described above should be saved as 'x' and the
'if_cc' vector should be saved as 'if_cc'

```{r data list1}
dat <- list(x=t_cell_data_cen, if_cc=if_cc)

```

Now we are ready to remove the cell-cycle effects from the data set.

# 2  ccRemover

---

## 2.1  Applying ccRemover

Using the data list which has been created in the previous step we can now
run ccRemover. Using the 'if_cc' vector ccRemover splits the data into two
sets, the cell-cycle annotated genes and the genes not annotated to the 
cell-cycle, which we call the control genes. ccRemover identifies the main 
components of variation in the control genes using a simple principal 
components analysis. These are then compared with the cell-cycle genes 
to identify which of them are cell-cycle related. Finally the effect of
the components identified as cell-cycle related is removed from the data.
Please refer to the original manuscript for a detailed explanation of the
method and examples of its application. 

As ccRemover runs it will print a table displaying the loadings of each
component on the control genes (xn.load), the cell-cycle genes (xy.load),
the difference between the two (diff.load) and the t-score for the difference
in loading (t.load.boot) for each iteration. Those components which have a
t-score greater than the cutoff (default value = 3) will be classified as 
cell-cycle effects and removed from the data.

For this vigenette we turn off the progress bar, tracking the bootstrap
repitions, as it does not work well with R-markdown.

```{r ccRemover}
xhat <- ccRemover(dat, bar=FALSE)

```

For this data set we see that ccRemover identifies the first principal 
component as a cell-cycle effect on its first iteration (t.load.boot = 7.13).
Once this component has been removed ccRemover does not identify any other 
cell-cycle effects present in the data on the second iteration. 
The ccRemover function outputs a transformed data matrix, 'xhat' which is 
free of cell-cycle effects. Further analysis can then be carried out on this
data set with the confounding effects of the cell-cycle present. This can 
greatly improve the performance of clustering algorithms on the data.

The final step here is to add the mean values back to the cleaned data matrix:

```{r add mean}
xhat <- xhat + mean_gene_exp

```


## 2.2  Settings 

If you choose to run ccRemover not using the default settings the following
options are available to you:

```{r ccRemover with settings}
xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar=FALSE)
```
 
 - The 'cutoff' is used to determine which of the effects are cell-cycle effects.
 The default and recommended value is 3, which roughly corresponds to a p-value 
 of 0.01. For data sets which have very low levels of cell-cycle activity this 
 value can be lowered to increase the detection of cell-cycle effects. Example 3 in the original manuscript was a case where a lower value of the cutoff was necessary. 

  - The 'max.it' value is the maximum number of iterations of the method.
  ccRemover will stop whenever it detects no more significant effects
  present in the data or it reaches its maximum number of iterations. 
  The default value is 4 but we have found that for many data sets the 
  cell-cycle effect will be effectively removed after 1 or 2 iterations.

  - The 'nboot' value corresponds to the number of bootstrap repetitions
  carried out to test the significance of the components. Please refer to
  the methods section original manuscript for a detailed description of 
  the estimation process. The default value is 200 and we have found this
  work effectively for most data sets. 

  - The 'ntop' parameter determines the number of principal components
  which are to be tested as cell-cycle effects upon each iteration. We
  have found that the default value of 10 works effectively for most 
  data sets. However, for extremely diverse data sets within which there
  are expected to be many sources of variation this value could be raised
  so that all elements of the cell-cycle effect can be identified and removed. 


# References














