---
title: "Clustering Tracks with CelltrackR"
author: "Inge Wortel"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Clustering}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
matrix_check <- ( packageVersion( "Matrix" ) > "1.6.1" )
irlba_check <- ( packageVersion( "irlba" ) <= "2.3.5.1" )

matrix_irlba_clash <- TRUE #( matrix_check & irlba_check )
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 4.5,
  fig.height = 3
)
```

```{r savepar, echo = FALSE}
# Save current par() settings
oldpar <- par( no.readonly =TRUE )
```

# Introduction

To group tracks with similar properties, one can in principle perform any clustering method of interest on a *feature matrix* of quantification metrics for each track in the dataset. The package comes with three convenience functions -- `getFeatureMatrix()`,`trackFeatureMap()`, and `clusterTracks()` -- to easily compute several metrics on all tracks at once, visualize them in 2 dimensions, and to cluster tracks accordingly. This tutorial shows how to use these functions to explore heterogeneity in a track dataset.


# Datasets

First load the package:

```{r pack, warning = FALSE, message = FALSE}
library( celltrackR )
```

The package contains a dataset of B and T cells in a mouse cervical lymph node, and neutrophils responding to an *S. aureus* infection in a mouse ear; all are imaged using two-photon microscopy. While the original data contained 3D coordinates, we'll use the 2D projection on the XY plane (see the vignettes on [quality control methods](./QC.html) and [preprocessing of the package datasets](./data-QC.html) for details). 

The T-cell dataset consists of 199 tracks of individual cells in a tracks object:

```{r Tdata}
str( TCells, list.len = 2 )
```

Each element in this list is a track from a single cell, consisting of a matrix with $(x,y)$ coordinates and the corresponding measurement timepoints:

```{r showTdata}
head( TCells[[1]] )
```

Similarly, we will also use the `BCells` and `Neutrophils` data:

```{r bdata}
str( BCells, list.len = 2 )
str( Neutrophils, list.len = 2 )
```

Since there are quite many cells, we'll sample just some of the tracks for the legibility of the plots in this tutorial -- but everything we will do could also be done on the complete datasets.

```{r sampleData}
# Take a sample
set.seed(1234)
TCells <- TCells[ sample( names(TCells), 30 ) ]
BCells <- BCells[ sample( names(BCells), 30 ) ]
Neutrophils <- Neutrophils[ sample( names(Neutrophils), 30 ) ]
```


Combine them in a single dataset, where labels also indicate celltype:

```{r combineData}
T2 <- TCells
names(T2) <- paste0( "T", names(T2) )
tlab <- rep( "T", length(T2) )

B2 <- BCells
names(B2) <- paste0( "B", names(B2) )
blab <- rep( "B", length(B2) )

N2 <- Neutrophils
names(N2) <- paste0( "N", names(Neutrophils) )
nlab <- rep( "N", length( N2) )

all.tracks <- c( T2, B2, N2 )
real.celltype <- c( tlab, blab, nlab )

```


# 1 Extracting a feature matrix

Using the function `getFeatureMatrix()`, we can quickly apply several quantification measures to all data at once (see `?TrackMeasures` for an overview of measures we can compute):

```{r featurematrix}
m <- getFeatureMatrix( all.tracks, 
                       c(speed, meanTurningAngle, 
                         outreachRatio, squareDisplacement) )

# We get a matrix with a row per track and one column for each metric:
head(m)
```

We can use this matrix to explore relationships between different metrics. For example, we can check the relationship between speed and mean turning angle:

```{r plotFM, fig.width = 4, fig.height = 3 }
plot( m, xlab = "speed", ylab = "mean turning angle" )
```


# 2 Dimensionality reduction methods: PCA, MDS, and UMAP

When using more than two metrics at once to quantify track properties, it becomes hard to visualize which tracks are similar to each other. Like with single-cell data, dimensionality reduction methods can help visualize high-dimensional track feature datasets. The function `trackFeatureMap()` is a wrapper method that helps to quickly visualize data using three popular methods: principal component analysis (PCA), multidimensional scaling (MDS), and uniform manifold approximate and projection (UMAP). The function `trackFeatureMap()` can be used for a quick visualization of data, or return the coordinates in the new axis system if the argument `return.mapping=TRUE`.

### 1.1 PCA

Use `trackFeatureMap()` to perform a principal component analysis (PCA) based on the measures "speed", "meanTurningAngle", "squareDisplacement", "maxDisplacement", and "outreachRatio" using the optional `labels` argument to color points by their real celltype and `return.mapping=TRUE` to also return the data rather than just the plot:

```{r pca}
pca <- trackFeatureMap( all.tracks, 
               c(speed,meanTurningAngle,squareDisplacement,
                 maxDisplacement,outreachRatio ), method = "PCA", 
               labels = real.celltype, return.mapping = TRUE )
```

Note that the B cells and Neutrophils are relatively well-separated from each other in this plot, but the T cells are hard to distinguish from neutrophils based on these features.

We can then inspect the data stored in `pca`. This reveals, for example, that the first principal component is correlated with speed:

```{r inspectPC}
pc1 <- pca[,1]
pc2 <- pca[,2]
track.speed <- sapply( all.tracks, speed )
cor.test( pc1, track.speed )
```

See `?prcomp` for details on how principal components are computed, and for further arguments that can be passed on to `trackFeatureMap()`.


### 1.2 MDS

Another popular method for visualization is multidimensional scaling (MDS), which is also supported by `trackFeatureMap()`:

```{r mds}
trackFeatureMap( all.tracks,
               c(speed,meanTurningAngle,squareDisplacement,maxDisplacement,
                 outreachRatio ), method = "MDS",
               labels = real.celltype )
```

Internally, `trackFeatureMap()` computes a distance matrix using `dist()` and then applies MDS using `cmdscale()`. See the documentation of `cmdscale` for details and further arguments that can be passed on via `trackFeatureMap()`.

Again, we find that the B cells and Neutrophils are separated in this plot, while T cells mix with neutrophils.

### 1.3 UMAP

Uniform manifold approximate and projection (UMAP) is another powerful method to explore structure in high-dimensional datasets. `trackFeatureMap()` supports visualization of tracks in a UMAP. Note that this option requires the `uwot` package. Please install this package first using `install.packages("uwot")`. 

```{r rspec, echo = FALSE, eval = matrix_irlba_clash}
library( RSpectra )
```


```{r umap}
trackFeatureMap( all.tracks,
        c(speed,meanTurningAngle,squareDisplacement,
          maxDisplacement,outreachRatio ), method = "UMAP",
          labels = real.celltype )
```


# 3 Clustering: hierarchical clustering and k-means

To go beyond visualizing similar and dissimilar tracks using multiple track features, `clusterTracks()` supports the explicit grouping of tracks into clusters using two common methods: hierarchical and k-means clustering.

### 3.1 Hierarchical clustering

Hierarchical clustering is performed by calling `hclust()` on a distance matrix computed via `dist()` on the feature matrix:

```{r hclus, fig.width = 7.5, fig.height = 3}
clusterTracks( all.tracks,
               c(speed,meanTurningAngle,squareDisplacement,maxDisplacement,
                 outreachRatio ), method = "hclust", labels = real.celltype )
```

See methods `dist()` and `hclust()` for details.

### 3.2 K-means clustering

Secondly, `clusterTracks()` also supports k-means clustering of tracks. Note that this requires an extra argument `centers` that is passed on to the `kmeans()` function and specifies the number of clusters to make. In this case, let's use three clusters because we have three celltypes:

```{r kmean, fig.height = 8, fig.width = 7 }
clusterTracks( all.tracks,
               c(speed,meanTurningAngle,squareDisplacement,maxDisplacement,
                 outreachRatio ), method = "kmeans", 
               labels = real.celltype, centers = 3 )
```

In these plots, we see the value of each feature in the feature matrix plotted for the different clusters, whereas the color indicates the "real" celltype the track came from.


```{r resetpar, echo = FALSE}
# Reset par() settings
par(oldpar)
```