---
title: "Starting secsse"
author: "Thijs Janzen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Starting secsse}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---
## Secsse introduction

secsse is an R package designed for multistate data sets under a concealed state
and speciation ('hisse') framework. In this sense, it is parallel to the 'MuSSE'
functionality implemented in 'diversitree', but it accounts for finding possible
spurious relationships between traits and diversification rates ('false
positives', Rabosky & Goldberg 2015) by testing against a 'hidden trait'
(Beaulieu et al. 2013), which is responsible for more variation in
diversification rates than the trait being investigated.

### Secsse input files

Similar to the 'diversitree' (Fitzjohn et al. 2012) and 'hisse'
(Beaulieu & O'Meara 2016) packages, secsse uses two input files: a rooted,
ultrametric tree in nexus format (for conversion of other formats to nexus,
we refer to the documentation in package 'ape') and a data file with two
columns,
the first containing taxa names and the second a numeric code for trait state
with a header (usually 0, 1, 2, 3, etc., but notice that 'NA' is a valid code
too, if you are not sure what trait state to assign to a taxon). Here, we will 
use a simple trait dataset with only values 0 and 1, indicating presence and
absence of a trait. A comma-separated value file (.csv) generated in MsExcel 
works particularly well. The \*.csv file can be loaded into R using the 
read.csv() function. and should look like this:

```{r}
library(secsse)
data(traits)
tail(traits)
```

This data set (here we see only the bottom lines of the data frame) has two
character states labeled as 0 and 1. Ambiguity about trait state (you are not
sure which trait state to assign a taxon too, or you have no data on trait state
for a particular taxon), can be assigned using 'NA'. secsse handles 'NA'
differently from a full trait state, in that it assigns probabilities to all
trait states for a taxon demarcated with 'NA'.

The second object we need is an ultrametric phylogenetic tree, that is rooted
and has labelled tips. One can load it in R by using read.nexus(). In our
example we load a prepared phylogeny named "phylo_vignette":

```{r}
data("phylo_vignette")
```

For running secsse it is important that tree tip labels agree with taxon names
in the data file, but also that these are in the same order. For this purpose,
we run the following piece of code prior to any analysis:

```{r}
sorted_traits <- sortingtraits(traits, phylo_vignette)
```

If there is a mismatch in the number of taxa between data and tree file, you
will receive an error message. However, to then identify which taxa are causing
issues and if they are in the tree or data file, you can use the name.check
function in the 'geiger'(Harmon et al. 2008) package:

```{r}
library(geiger)
#pick out all elements that do not agree between tree and data
mismat <- name.check(phylo_vignette, traits)
#this will call all taxa that are in the tree, but not the data file
#mismat$tree_not_data
#and conversely,
#mismat$data_not_tree
```

If you have taxa in your tree file that do not appear in your trait file, it is
worth adding them with value `NA` for trait state. 
You can visualise the tip states using the package diversitree:

```{r plot_tree}
if (requireNamespace("diversitree")) {
  for_plot <- data.frame(trait = traits$trait,
                         row.names = phylo_vignette$tip.label)
diversitree::trait.plot(phylo_vignette, dat = for_plot,
                        cols = list("trait" = c("blue", "red")),
                        type = "p")
}

```

After you are done properly setting up your data, you can proceed to setting
parameters and constraints.

#### Note on assigning ambiguity to taxon trait states

If the user wishes to assign a taxon to multiple trait states, because he/she is
unsure which state best describes the taxon, he/she can use `NA`. `NA` is used
when there is no information on possible state at all; for example when a state
was not measured or a taxon is unavailable for inspection. `NA` means a taxon is
equally likely to pertain to any state. In case the user does have some
information, for example if a taxon can pertain to multiple states, or if there
is uncertainty regarding state but one or multiple states can with certainty be
excluded, secsse offers flexibility to handle ambiguity. In this case, the user
only needs to supply a trait file, with at least four columns, one for the taxon
name, and three for trait state. Below, we show an example of what the trait
info should be like (the column with species' names has been removed). If a
taxon may pertain to trait state 1 or 3, but not to 2, the three columns should 
have at least the values 1 and a 3, but never 2 (species in the third row). On
the other hand, the species in the fifth row can pertain to all states: the
first column would have a 1, the second a 2, the third a 3 (although if you only
have this type of ambiguity, it is easier to assign `NA` and use a single-column
data file).

```{r}
#       traits traits traits
# [1,]      2      2      2
# [2,]      1      1      1
# [3,]      2      2      2
# [4,]      3      1      1
# [5,]      1      2      3
```


## Setting up an analysis

To perform a Maximum Likelihood analysis, secsse makes use of the
function `DDD::optimize()`, which in turn, typically, uses the subplex
package to perform the Maximum Likelihood optimization. In such an
analysis, we need to specify which parameters we want to optimize, which
parameters to keep fix, and the initial values per parameter. We do so
by providing the structure of the input parameters (e.g. in vector,
matrix or list form), and within this structure we highlight values that
stay at zero with a 0, and parameters to be inferred with indexes 1, 2,
... n. The optimizer will then use these indexes to fill in the
associated parameters and perform the optimization. If this all seems a
bit unclear, please continue reading and look at the fully set up
parameterization for the maximum likelihood below to gain more insight.

### ETD

In the ETD model, we assume that the examined trait affects
diversification. In a secsse analysis we need to specify the structure
of three distinct properties: the lambda list, the mu vector and the
transition (Q) matrix. Each of these informs properties of the model of
speciation, extinction and trait-shifts respectively.

#### Lambda matrices

Speciation in a secsse model is defined using a list of matrices, where
each matrix highlights the state of the daughter species resulting from
a speciation event. In our case, we have a trait with two states, and
thus we will have to specify a list with two matrices, one for each
state, where each matrix in turn will then specify the daughter states.
We can do so by hand, but secsse includes functionality to do this in a
more organized manner - this is especially useful if you have a trait
with more than two states for instance. In this more organized manner,
we can provide secsse with a matrix specifying the potential speciation
results, and secsse will construct the lambda list accordingly:

```{r ETD_lambda}
spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "ETD")
lambda_list
```

Let's see what the code has done. First, we create a `spec_matrix`, where
the first column indicates the parent species (0 or 1) and the second
and third column indicate the identities of the two daughter species. In
this case, we choose for symmetric speciation without a change of trait,
e.g. the daughters have the same trait as the parent. If you have
evidence of perhaps asymmetric inheritance, you can specify this here.
The fourth column indicates the associated rate indicator. In this case
we choose two different speciation rates. We choose two concealed
states, as it is good practice to have the same number of concealed
states as observed states. The resulting `lambda_list` then contains four
entries, one for each unique state (see the names of the entries in the list), 
that is, for each combination of observed and concealed states, where the
concealed states are indicated with a capital letter.
Looking at the first entry in the list, e.g. the
result of a speciation event starting with a parent in state 0A, will
result with rate 1 in two daughter species of state 0A as well. The way
to read this, is by looking at the row and column identifiers of the
entered rate. Similarly, for a speciation event starting in state 1A
(`lambda_list[[2]]`), the two daughter species are 1A as well, but this
time with rate 2, as we specified that species with trait 1 will have a
different speciation rate. Note that here, rates 1 and 2 are ordered
with the observed trait, we will later explore the CTD model, where the
rates will be sorted according to the concealed state.

#### Mu vector

Having the speciation rates set, we can move on to extinction rates.
Since we are using the ETD model, here we also expect the extinction
rates to be different:

```{r ETD_mu}
mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "ETD",
                                   lambda_list = lambda_list)
mu_vec
```

The function `create_mus_vector()` takes the same standard information we
provided earlier, with as addition our previously made `lambda_list`. It uses the
`lambda_list` to identify the rate indicators (in this case 1 and 2) that
are already used and to thus pick new rates. We see that secsse has
created a named vector with two extinction rates (3 and 4), which are
associated with our observed traits 0 and 1.

#### Transition matrix

Lastly, we need to specify our transition matrix. Often, Q matrices can get
quite large and complicated, the more states you are analyzing. We have devised
a tool to more easily put together Q matrices. This tool starts from the
so-called `shift_matrix`, the basic matrix in which we only find information on
transitions between examined states. The information contained in this
`shift_matrix` is then automatically mimicked for inclusion in the full matrix,
to ensure that the same complexity in examined state transitions is also found
in concealed states. 
Instead of specifying the entire `shift_matrix`, instead it suffices to
only specify the non-zero transitions. In this case these are from state 0 
to 1, and vice versa:

```{r ETD_Q}
shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 5))
shift_matrix <- rbind(shift_matrix, c(1, 0, 6))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
```

Thus, we first specify a matrix containing the potential state
transitions, here 0-\>1 and 1-\>0. Then, we use
`create_q_matrix()` to create the q-matrix. By setting
`diff.conceal` to `TRUE`, we ensure that the concealed states will get
their own rates specified. Setting this to `FALSE` would set their rates
equal to the observed rates (5 and 6). The way to read the transition
matrix is column-row, e.g. starting at state 0A, with rate 5 the species
will shift to state 1A and with rate 7 it will shift to state 0B. We
intentionally ignore 'double' shifts, e.g. from 0A to 1B, where both the
observed and the concealed trait shift at the same time. If you have
good evidence to include such shifts in your model, you can modify the
trans_matrix by hand of course.

#### Maximum Likelihood

We have now specified the required ingredients to perform Maximum
Likelihood analyses. Prerequisites for performing Maximum Likelihood analyses with secsse
are that we specify the ids of the rates we want optimized, and provide
initial values. We can do so as follows:

```{r ETD_ML_init}
idparsopt <- 1:8 # our maximum rate parameter was 8
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 8)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)
```

Here, we specify that we want to optimize all parameters with rates 1,
2, ..., 8. We set these at initial values at 0.1 for all parameters. Here, we
will only use one starting point, but in practice it is often advisable
to explore multiple different initial values to avoid getting stuck in a
local optimum and missing the global optimum. `idparsfix` and `initparsfix`
indicate that all entries with a zero are to be kept at the value zero.
Lastly, we set the sampling fraction to be c(1, 1), this indicates to
secsse that we have sampled per trait all species with that trait in our
dataset. Alternatively, if we know that perhaps some species with trait
0 are missing, we could specify that as c(0.8, 1.0). Thus, note that the
sampling fraction does not add up to 1 across traits, but within traits.

And now we can perform maximum likelihood:

```{r ETD_ML}

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
```

We can now extract several pieces of information from the returned
answer:

```{R ETD_res}
ML_ETD <- answ$ML
ETD_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_ETD
ETD_par
spec_rates <- ETD_par[1:2]
ext_rates <- ETD_par[3:4]
Q_Examined <- ETD_par[5:6]
Q_Concealed <- ETD_par[7:8]
spec_rates
ext_rates
Q_Examined
Q_Concealed
```

The function `extract_par_vals()` goes over the list `answ$MLpars` and
places the found parameter values back in consecutive vector 1:8 in this
case. Here, we find that the speciation rate of trait 1 is higher than
the speciation rate of trait 0.

### CTD

Let's compare our findings with a CTD model, e.g. a model centered
around the concealed trait. Again, we need to specify our lambda list,
mu vector and transition matrix. We will see that this is quite
straightforward now that we have gotten the hang of how this works.

#### Lambda matrices

Again, we specify two distinct rates, indicating that the observed state
inherits faithfully to the daughter species. However, this time, we set
the model indicator to "CTD":

```{r CTD_lambda}
spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "CTD")
lambda_list
```

The resulting `lambda_list` now has the chosen rates 1 and 2 sorted
differently across the matrices, with matrices 1 and 2 containing rate
1, and matrices 3 and 4 containing rate 2. Looking at the column names
of the matrices, states 1 and 2 are states 0A and 1A, and states 3 and 4
are states 0B and 1B, in other words, speciation rate 1 is now
associated with all states with concealed state A, and speciation rate 2
is now associated with all states with concealed state B.

#### Mu vector

For the mu vector, we repeat the same we did for the ETD model:

```{r CTD_mu}
mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "CTD",
                                   lambda_list = lambda_list)
mu_vec
```

Here, again, we see that whereas previously extinction rate 3 was
associated with states 0A and 0B (e.g. all states with state 0), it is
now associated with states 0A and 1A, e.g. all states associated with
concealed state A.

#### Transition matrix

Setting up the transition matrix is not different from the ETD model,
the same transitions are possible:

```{r CTD_Q}
shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 5))
shift_matrix <- rbind(shift_matrix, c(1, 0, 6))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
```

#### Maximum Likelihood

Now that we have specified our matrices, we can use the same code we
used for the ETD model to perform our maximum likelihood:

```{r CTD_ML}
idparsopt <- 1:8 # our maximum rate parameter was 8
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 8)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
ML_CTD <- answ$ML
CTD_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_CTD
CTD_par
spec_rates <- CTD_par[1:2]
ext_rates <- CTD_par[3:4]
Q_Examined <- CTD_par[5:6]
Q_Concealed <- CTD_par[7:8]
spec_rates
ext_rates
Q_Examined
Q_Concealed
```

Here we now find that state A has a very low speciation rate, in
contrast to a much higher speciation rate for state B (remember that
speciation rate 1 is now associated with A, and not with state 0!).
Similarly, extinction rates for both states are also quite different,
with state A having a much lower extinction rate than state B. Examined
trait shifts (`Q_Examined`) are quite low, whereas concealed trait shifts
seem to be quite high. The LogLikelihood seems to be lower than what we
found for the ETD model.

### CR

As a check, we will also fit a model where there is no trait effect -
perhaps we are looking for an effect when there is none. This is always
a good sanity check.

#### Lambda matrices

To specify the lambda matrices, this time we choose the same rate
indicator across both states.

```{r CR_lambda}
spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 1))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "CR")
lambda_list
```

#### Mu vector

The mu vector follows closely from this, having a shared extinction rate
across all states:

```{r CR_mu}
mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "CR",
                                   lambda_list = lambda_list)
mu_vec
```

#### Transition matrix

We will use the same transition matrix as used before, although one
could perhaps argue that without a trait effect, all rates in the
transition matrix (both forward and reverse trait shifts) should share
the same rate. Here, we will choose the more parameter-rich version
(Home assignment: try to modify the code to perform an analysis in which
all rates in the transition matrix are the same).

```{r CR_Q}
shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 3))
shift_matrix <- rbind(shift_matrix, c(1, 0, 4))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
```

#### Maximum Likelihood

```{r CR_ML}
idparsopt <- 1:6 # our maximum rate parameter was 6
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 6)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
ML_CR <- answ$ML
CR_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_CR
CR_par
spec_rate <- CR_par[1]
ext_rate <-  CR_par[2]
Q_Examined <- CR_par[3:4]
Q_Concealed <- CR_par[5:6]
spec_rate
ext_rate
Q_Examined
Q_Concealed
```

We now recover a non-zero extinction rate, and much higher transition
rates for the concealed than for the observed states.

### Model comparisong using AIC

Having collected the different log likelihoods, we can directly compare
the models using AIC. Remembering that the AIC is 2k - 2LL, where k is
the number of parameters of each model and LL is the Log Likelihood, we
can calculate this as follows:

```{r AIC}
res <- data.frame(ll = c(ML_ETD, ML_CTD, ML_CR),
                  k  = c(8, 8, 6),
                  model = c("ETD", "CTD", "CR"))
res$AIC <- 2 * res$k - 2 * res$ll
res
```

I can now reveal to you that the tree we used was generated using an ETD
model, which we have correctly recovered!

## Further help

If after reading these vignettes, you still have questions, please feel free to
create an issue at the package's GitHub repository
https://github.com/rsetienne/secsse/issues or e-mail the authors for help with 
this R package. Additionally, bug reports and feature requests are welcome by
the same means.

## References 

Beaulieu, J. M., O'Meara, B. C., & Donoghue, M. J. (2013). Identifying hidden
rate changes in the evolution of a binary morphological character: the evolution
of plant habit in campanulid angiosperms. Systematic biology, 62(5), 725-737.

Beaulieu, J. M., & O'Meara, B. C. (2016). Detecting hidden diversification
shifts in models of trait-dependent speciation and extinction. Systematic
biology, 65(4), 583-601.

FitzJohn, R. G. (2012). Diversitree: comparative phylogenetic analyses of
diversification in R. Methods in Ecology and Evolution, 3(6), 1084-1092.

Harmon, L. J., Weir, J. T., Brock, C. D., Glor, R. E., & Challenger, W. (2008).
GEIGER: investigating evolutionary radiations. Bioinformatics, 24(1), 129-131.

Rabosky, D. L., & Goldberg, E. E. (2015). Model inadequacy and mistaken
inferences of trait-dependent speciation. Systematic Biology, 64(2), 340-355.