---
title: "treestructure applied to structured coalescent simulation"
author: "Erik Volz"
date: "`r Sys.Date()`"
output: 
  bookdown::html_vignette2:
  #rmarkdown::html_vignette
  #bookdown::pdf_book:
    toc: TRUE
pkgdown:
  as_is: true
fontsize: 12pt
vignette: >
  %\VignetteIndexEntry{treestructure applied to structured coalescent simulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 11,
  warning = FALSE, 
  message = FALSE
)
```


# Structured coalescent simulation 

This example shows the function `trestruct` applied to a simulated structured 
coalescent tree that includes samples from a large constant size population and 
samples from three small 'outbreaks' which are growing exponentially. 
These simulations were generated with the [phydynR package](https://emvolz-phylodynamics.github.io/phydynR/). 

```{r message=FALSE}
library(ape)
library(treestructure)
library(ggtree)
```

Load the tree: 

```{r}
( tree <- ape::read.tree( system.file('sim.nwk', package = 'treestructure') ) ) 
```


Note that the tip labels corresponds to the deme of each sample. 
'1' is the constant size reservoir, and '0' is the exponentially growing deme. 


This will run the `treestructure` algorithm under default setting: 
```{r message=FALSE}
s <- trestruct( tree ) 
```

You can print the results: 

```{r}
print(s) 
```

## Plotting results

The default plotting behavior uses the `ggtree` package if available. 
```{r message=FALSE}
plot(s)  + ggtree::geom_tiplab() 
```

If not, or if desired, `ape` plots are available
```{r}
plot( s, use_ggtree = FALSE )
```


For subsequent analysis, you may want to turn the `treestructure` result into a 
dataframe: 
```{r}
structureData <- as.data.frame( s ) 
head( structureData )
```

Each cluster and partition assignment is stored as a factor. You could use `split` 
to get a data frame for each partition. 
Suppose we want a tree corresponding to partition 1: 
```{r}
with ( structureData, 
       ape::keep.tip(s$tree, taxon[ partition==1 ] ) 
       ) -> partition1
partition1
plot(partition1)
```


## Parameter choice and number of clusters

Two parameters will have large influence on results: 

1. `level` is the significance level for subdividing a clade into a new cluster. 
To detect more clusters, increase `p`, but note that this will also increase the 
false positive rate. 
2. `minCladeSize` controls the smallest allowed cluster size in terms of the 
number of tips. With a smaller value, smaller clusters may be detected, but 
computation time will increase. 

Example: 
```{r message=FALSE}
trestruct( tree, level = .05, minCladeSize = 5 )
```

In practice, clustering thresholds are always subjective and the best value of 
the `level` parameter will depend on your application. 
One way to choose an appropriate `level` would be to use additional data associated 
with each sample. You can select the `level` which gives a set of clusters that 
explains the most variance in the data of interest (e.g. use the cluster as a 
factor in an ANOVA). 

### CH index

Alternatively, in the absence of any additional data, the `treestructure` package 
supports using the [CH index](https://en.wikipedia.org/wiki/Calinski–Harabasz_index) 
to compare different `level`s. This statistic is based on the ratio of the 
between-cluster and within-cluster variance of the time of each node 
(distance from the root) and returns the `level` such that this ratio is maximized. 
If you wish to use the CH index, pass `level = NULL` to `trestruct`, and 
read documentation for the `levellb`, `levelub`, and `res` parameters.