---
title: "Application on toy example"
author: "Alexis Vandenbon"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Application on toy example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = 'center'
)

library(ggplot2)
theme_set(cowplot::theme_cowplot())
```

# Application on a toy dataset

A small toy dataset is included in the package. The toy dataset includes:

-   `dat.expression`: a toy scRNA-seq dataset with genes (rows) and cells (columns)

-   `dat.tsne`: 2D coordinates of the cells in a t-SNE splot

First, let's apply `haystack` (the main function of the package) on the toy dataset. This should take just several seconds on a typical desktop computer.

```{r example1}
library(singleCellHaystack)
set.seed(1234)

# run the main 'haystack' analysis
# inputs are:
# 1) the coordinates of the cells in the input space (here: dat.tsne)
# 2) the expression data (dat.expression)
res <- haystack(dat.tsne, dat.expression)

# the returned results 'res' is of class 'haystack'
class(res)
```

Let's have a look at the most significant differentially expressed genes (DEGs).

```{r example2}
# show top 10 DEGs
show_result_haystack(res.haystack = res, n=10)

# alternatively: use a p-value threshold
#show_result_haystack(res.haystack = res, p.value.threshold = 1e-10)
```

One of the most significant DEGs is "gene_497". Here we visualize its expression in the t-SNE plot. As you can see, this DEG is expressed only in cells in the upper-left corner of the plot.

```{r}
d <- cbind(dat.tsne, t(dat.expression))
d[1:4, 1:4]
```

```{r fig.width=6, fig.height=4}
library(ggplot2)

ggplot(d, aes(tSNE1, tSNE2, color=gene_497)) +
  geom_point() +
  scale_color_distiller(palette="Spectral")
```

Yes, the coordinates of the cells in this toy example t-SNE space roughly resemble a haystack; see [the Haystack paintings by Monet](https://en.wikipedia.org/wiki/Haystacks_(Monet_series)).

# Clustering and visualization

You are not limited to single genes. Here, we pick up a set of DEGs, and group them by their expression pattern in the plot into 5 clusters.

```{r example4}
# get the top most significant genes, and cluster them by their distribution pattern in the 2D plot
sorted.table <- show_result_haystack(res.haystack = res, p.value.threshold = 1e-10)
gene.subset <- row.names(sorted.table)

# k-means clustering
#km <- kmeans_haystack(dat.tsne, dat.expression[gene.subset, ], grid.coordinates=res$info$grid.coordinates, k=5)
#km.clusters <- km$cluster

# alternatively: hierarchical clustering
hm <- hclust_haystack(dat.tsne, dat.expression[gene.subset, ], grid.coordinates=res$info$grid.coordinates)
```

... and visualize the pattern of the selected genes.

```{r fig.width=6, fig.height=8}
ComplexHeatmap::Heatmap(dat.expression[gene.subset, ], show_column_names=FALSE, cluster_rows=hm, name="expression")
```

We divide the genes into clusters with cutree.

```{r}
hm.clusters <- cutree(hm, k=4)
table(hm.clusters)
```

Then calculate the average expression of the genes in each cluster.

```{r}
for (cluster in unique(hm.clusters)) {
  d[[paste0("cluster_", cluster)]] <- colMeans(dat.expression[names(which(hm.clusters == cluster)), ])
}
```

```{r fig.width=8, fig.height=6}
lapply(c("cluster_1", "cluster_2", "cluster_3", "cluster_4"), function(cluster) {
  ggplot(d, aes(tSNE1, tSNE2, color=.data[[cluster]])) +
  geom_point() +
  scale_color_distiller(palette="Spectral")
}) |> patchwork::wrap_plots()
```

From this plot we can see that genes in each cluster are expressed in different subsets of cells.
