---
title: "Using simphony to evaluate rhythm detection"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using simphony to evaluate rhythm detection}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(message = FALSE, collapse = TRUE, comment = '#>', fig.retina = 2)
```

`simphony` is a framework for simulating rhythmic data, especially gene expression data. Here we show an example of using it to benchmark a method for detecting rhythmicity.

## Load the packages we'll use

Internally, `simphony` uses the `data.table` package, which provides an enhanced version of the standard R `data.frame`. We'll use `data.table` for this example as well.
```{r}
library('data.table')
library('ggplot2')
library('kableExtra')
library('knitr')
library('limma')
library('precrec')
library('simphony')
```

## Simulate the data

Here we create a `data.table` called `featureGroups` that specifies the desired properties of the simulated genes. We want 75% of simulated genes to be non-rhythmic and 25% to have a rhythm amplitude of 1.1. Properties not specified in `featureGroups` will be given their default values.

Our simulated experiment will have 200 genes. Expression values will be sampled from the negative binomial family, which models read counts from next-generation sequencing data. The interval between time points will be 2 (default period of 24), with one replicate per time point. We also use the default time range of our simulated data points of between 0 and 48 hours.
```{r}
set.seed(44)
featureGroups = data.table(fracFeatures = c(0.75, 0.25), amp = c(0, 0.3))
simData = simphony(featureGroups, nFeatures = 200, interval = 2, nReps = 1, family = 'negbinom')
```

The output of `simphony` has three components: `abundData`, `sampleMetadata`, and `featureMetadata`. `abundData` is a matrix that contains the simulated expression values. Each row of corresponds to a gene, each column corresponds to a sample. Since we sampled from the negative binomial family, all expression values are integers.
```{r}
kable(simData$abundData[1:3, 1:3])
```

`sampleMetadata` is a `data.table` that contains the condition (`cond`) and time for each sample. Here we simulated one condition, so `cond` is 1 for all samples.
```{r}
kable(simData$sampleMetadata[1:3])
```

`featureMetadata` is a `data.table` that contains the properties of each simulated gene in each condition. The `group` column corresponds to the row in `featureGroups` to which the gene belongs.
```{r}
kable(simData$featureMetadata[149:151, !'dispFunc']) %>%
  kable_styling(font_size = 12)
```

## Plot the simulated time-course for selected genes

Here we plot the simulated time-course for a non-rhythmic gene and a rhythmic gene. We use the `mergeSimData` function to merge the expression values, the sample metadata, and the gene metadata.
```{r}
fmExample = simData$featureMetadata[feature %in% c('feature_150', 'feature_151')]
dExample = mergeSimData(simData, fmExample$feature)
```

We also want to compare the simulated expression values with their underlying distributions over time, for which we can use the `getExpectedAbund` function. Since we sampled from the negative binomial family, the resulting `mu` column corresponds to the expected log~2~ counts.
```{r}
dExpect = getExpectedAbund(fmExample, 24, times = seq(0, 48, 0.25))
```

Then it all comes together with `ggplot`.
```{r, fig.width = 6, fig.height = 2.75}
dExample[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))]
dExpect[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))]

ggplot(dExample) +
  facet_wrap(~ featureLabel, nrow = 1) +
  geom_line(aes(x = time, y = log2(2^mu + 1)), size = 0.25, data = dExpect) +
  geom_point(aes(x = time, y = log2(abund + 1)), shape = 21, size = 2.5) +
  labs(x = 'Time (h)', y = expression(log[2](counts + 1))) +
  scale_x_continuous(limits = c(0, 48), breaks = seq(0, 48, 8))
```

## Detect rhythmic genes

We can use the `limma` package to detect rhythmic genes based on a linear model that corresponds to cosinor regression. 
```{r}
sampleMetadata = copy(simData$sampleMetadata)
sampleMetadata[, timeCos := cos(time * 2 * pi / 24)]
sampleMetadata[, timeSin := sin(time * 2 * pi / 24)]
design = model.matrix(~ timeCos + timeSin, data = sampleMetadata)
```

Here we follow the typical `limma` workflow: fit the linear model for each gene, run empirical Bayes, and extract the relevant summary statistics. We pass `lmFit` the log~2~ transformed counts.
```{r}
fit = lmFit(log2(simData$abundData + 1), design)
fit = eBayes(fit, trend = TRUE)
rhyLimma = topTable(fit, coef = 2:3, number = Inf)
```

## Evaluate accuracy of rhythmic gene detection

First we merge the results from `limma` with the known amplitudes from `featureMetadata`.
```{r}
rhyLimma$feature = rownames(rhyLimma)
rhyLimma = merge(data.table(rhyLimma), simData$featureMetadata[, .(feature, amp0)], by = 'feature')
```

We can plot the distributions of p-values of rhythmicity for non-rhythmic and rhythmic genes. P-values for non-rhythmic genes are uniformly distributed between 0 and 1, as they should be under the null hypothesis. P-values for rhythmic genes, on the other hand, tend to be closer to 0.
```{r, fig.width = 3.5, fig.height = 3}
ggplot(rhyLimma) +
  geom_jitter(aes(x = factor(amp0), y = P.Value), shape = 21, width = 0.2) +
  labs(x = expression('Rhythm amplitude ' * (log[2] ~ counts)), y = 'P-value of rhythmicity')
```

Finally, we can summarize the ability to distinguish non-rhythmic and rhythmic genes using a receiver operating characteristic (ROC) curve (here we use the `precrec` package).
```{r, fig.width = 3, fig.height = 3}
rocprc = evalmod(scores = -log(rhyLimma$P.Value), labels = rhyLimma$amp0 > 0)
autoplot(rocprc, 'ROC')
```