---
title: "The 'aphid' package for analysis with profile hidden Markov models"
author: "Shaun Wilkinson"
date: "2017-06-23"
output: rmarkdown::html_vignette
bibliography: aphid.bib
vignette: >
  %\VignetteIndexEntry{Introduction to the aphid package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE, warning = FALSE}
#knitr::opts_chunk$set(out.width='750px', dpi=200)
#knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

--------------------------------------------------------------------------------
## Abstract
**aphid** is an R package for the development and application of
hidden Markov models and profile HMMs for biological sequence analysis. 
It contains functions for multiple and pairwise sequence alignment, 
model construction and parameter optimization, file import/export,
implementation of the forward, backward and Viterbi algorithms for conditional sequence probabilities, tree-based sequence weighting, and sequence simulation. 
The package has a wide variety of uses including database searching, 
gene-finding and annotation, phylogenetic analysis and sequence classification.  

## Introduction
Hidden Markov models (HMMs) underlie many of the most important tasks in 
computational biology, including multiple sequence alignment, 
genome annotation, and increasingly, sequence database searching. 
Originally developed for speech recognition algorithms, 
their application to the field of molecular biology has increased 
dramatically since advances in computational capacity have 
enabled full probabilistic analysis in place of heuristic approximations. 
Pioneering this transition are two groups lead by 
Anders Krogh and Sean Eddy, whose respective software packages 
[SAM](https://engineering.ucsc.edu/departments/biomolecular-engineering/) 
and [HMMER](http://www.hmmer.org) 
have underpinned HMM-based bioinformatic analysis for over two 
decades. 

Here, we present the **aphid** package for analysis with 
profile hidden Markov models in the R environment [@RCoreTeam2015]. 
The package contains functions for developing, plotting, importing and 
exporting both standard and profile HMMs, as well as implementations of 
the forward, backward and Viterbi algorithms for computing full and optimal
conditional sequence probabilities. The package also features
a multiple sequence alignment tool that produces high quality alignments
*via* profile HMM training.

## The 'aphid' package
###Dependencies
The **aphid** package is designed to work in conjunction with the "DNAbin" 
and "AAbin" object types produced using the **ape** package 
[@Paradis2004; @Paradis2012].
These object types, in which sequences are represented in a bit-level coding 
scheme, are preferred over standard character-type sequences for maximizing 
memory and speed efficiency.
While we recommend using **ape** alongside **aphid**, it is not a 
requisite and as such is listed in the "Suggests" rather than "Imports" 
section of the package description. 
Indeed, any sequence of standard ASCII characters is supported, 
making **aphid** suitable for other applications outside of biological 
sequence analysis. 
However, it should be noted that if DNA and/or amino acid sequences are 
input as character vectors, the functions may not recognize 
the ambiguity codes and therefore are not guaranteed to treat them 
appropriately.

To maximize speed, the low-level dynamic programming functions
(including the forward, backward, Viterbi, and maximum 
*a posteriori* algorithms) are written in C++ linking to the Rcpp package 
[@Eddelbuettel2011]. 
R versions of these functions are also maintained for the purposes 
of debugging, experimentation and code interpretation.
This package also relies on the **openssl** package [@Ooms2016] for 
sequence and alignment comparisons using the MD5 hash algorithm.

###Classes
Two primary object classes, "HMM" (hidden Markov model) and 
"PHMM" (profile hidden Markov model) are generated using the **aphid** 
functions deriveHMM and derivePHMM, respectively. 
These objects are lists consisting of emission and transition 
probability matrices (elements named "E" and "A"), and 
non-mandatory elements that may include vectors of background 
emission and transition probabilities ("qe" and "qa", respectively) 
and other model metadata including "name", "description", 
"size" (the number of modules in the model), and 
"alphabet" (the set of symbols/residues emitted by the model). 
Objects of class "DPA" (dynamic programming array) are also generated by 
the Viterbi and forward/backward functions. These are 
predominantly created for the purposes of succinct console-printing. 

###Functions
HMMs and PHMMs are explained in more detail throughout the following sections 
using **aphid** package functions to demonstrate their utility. 
The examples are borrowed from Durbin et al [-@Durbin1998], 
to which users are encouraged to refer for a more in-depth explanation on the 
theory and application of these models. Book chapter numbers are provided 
wherever possible for ease of reference. 

#### Hidden Markov Models
A hidden Markov model is a probabilistic data-generating mechanism for a 
sequence or set of sequences. 
It is depicted by a network of *states* each emitting symbols from a finite 
*alphabet* according to a set of *emission probabilities*, whose values are 
specific to each state. 
The states are traversed by an interconnecting set of 
*transition probabilities*, that include the probability
of remaining in any given state and those of transitioning 
to each of the other connected states.

An example of a simple HMM is given in Durbin et al 
[-@Durbin1998] chapter 3.2. 
An imaginary casino has two dice, one fair and one weighted.
The fair dice emits residues from the alphabet {1, 2, 3, 4, 5, 6}
with equal probabilities (1/6 for each residue).
The probability of rolling a "6" with the loaded dice is 0.5, while
that of each of the other five residues is 0.1.
If the dealer has the fair dice, he may secretly switch to the loaded 
dice with a probability of 0.05 after each roll, leaving a 
95% chance that he will retain the fair dice. Alternatively, if he has the 
loaded dice, he will switch back to the fair dice with a probability of 
0.1, or more likely, retain the loaded dice with a probability of 0.9.

This example can be represented by a simple two-state hidden Markov
model. The following code manually builds and plots the "HMM" object. 

```{r, fig.width=7.15, fig.height=4}
library("aphid")
states <- c("Begin", "Fair", "Loaded")
residues <- paste(1:6)
### Define transition probability matrix A
A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
dimnames(A) <- list(from = states, to = states)
### Define emission probability matrix E
E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
dimnames(E) <- list(states = states[-1], residues = residues)
### Create the HMM object
x <- structure(list(A = A, E = E), class = "HMM")
### Plot the model
plot(x, textexp = 1.5)
### Optionally add the transition probabilities as text
text(x = 0.02, y = 0.5, labels = "0.95")
text(x = 0.51, y = 0.5, labels = "0.90")
text(x = 0.5, y = 0.9, labels = "0.05")
text(x = 0.5, y = 0.1, labels = "0.10")
```
**Figure 1: A simple hidden Markov model for the dishonest casino example.**
The plot.HMM method depicts the transition probabilities as weighted lines, 
and emission probabilities as horizontal grey bars. 
No begin/end state is modeled in this example; however, this can be achieved 
by entering non-zero probabilities in the first row and column of the 
transition matrix and passing "begin = TRUE" to `plot.HMM`.

For a sequence of observed rolls, we can establish the most likely sequence 
of hidden states (including when the dice-switching most likely occurred) using 
the Viterbi algorithm. 
In the example given in Durbin et al [-@Durbin1998] chapter 3.2, the observed 
sequence of 300 rolls is:
```{r, echo = FALSE}
data(casino)
cat("", 
    paste0(casino[1:50], collapse = ""), "\n", 
    paste0(casino[51:100], collapse = ""), "\n", 
    paste0(casino[101:150], collapse = ""), "\n", 
    paste0(casino[151:200], collapse = ""), "\n", 
    paste0(casino[201:250], collapse = ""), "\n", 
    paste0(casino[251:300], collapse = ""), "\n")
```

Some observable clusters of 6's suggest that the loaded dice made an appearance 
at some stage, but when did the dice-switching occur?
In the following code, the Viterbi algorithm is used to find the most likely 
sequence of hidden states given the model.
```{r}
data(casino)
### The actual path is stored in the names attribute of the sequence
actual <- c("F", "L")[match(names(casino), c("Fair", "Loaded"))]
### Find the predicted path
vit1 <- Viterbi(x, casino)
predicted <- c("F", "L")[vit1$path + 1]
### Note the path element of the output Viterbi object is an integer vector
### the addition of 1 to the path converts from C/C++ to R's indexing style
```

Comparing the predicted path with the actual hidden sequence, the Viterbi 
algorithm wasn't far off:
```{r, echo = FALSE}

cat("", 
    "Actual    ", paste0(actual[1:50], collapse = ""), "\n", 
    "Predicted ", paste0(predicted[1:50], collapse = ""), "\n\n", 
    "Actual    ", paste0(actual[51:100], collapse = ""), "\n",
    "Predicted ", paste0(predicted[51:100], collapse = ""), "\n\n", 
    "Actual    ", paste0(actual[101:150], collapse = ""), "\n", 
    "Predicted ", paste0(predicted[101:150], collapse = ""), "\n\n", 
    "Actual    ", paste0(actual[151:200], collapse = ""), "\n", 
    "Predicted ", paste0(predicted[151:200], collapse = ""), "\n\n", 
    "Actual    ", paste0(actual[201:250], collapse = ""), "\n", 
    "Predicted ", paste0(predicted[201:250], collapse = ""), "\n\n", 
    "Actual    ", paste0(actual[251:300], collapse = ""), "\n",
    "Predicted ", paste0(predicted[251:300], collapse = ""), "\n")
```


We can also calculate the full and posterior probabilities of the sequence 
given the model using the `forward` and/or `backward` algorithms:

```{r, fig.width=7.15, fig.height=4}
casino.post <- posterior(x, casino)
plot(1:300, seq(0, 1, length.out = 300), type = "n", xlab = "Roll number",
     ylab = "Posterior probability of dice being fair")
starts <- which(c("L", actual) == "F" & c(actual, "F") == "L")
ends <- which(c("F", actual) == "L" & c(actual, "L") == "F") - 1
for(i in 1:6) rect(starts[i], 0, ends[i], 1, col = "grey", border = NA)
lines(1:300, casino.post[1, ])
```
**Figure 2: Posterior state probabilities for the 300 dice rolls.** 
The line shows the posterior probability that the dice was fair at each roll,
while the grey rectangles show the actual periods for which the loaded dice was 
being used. See Durbin et al [-@Durbin1998] chapter 3.2 for more details.


##### Deriving HMMs from sequence data
The **aphid** package features the function `deriveHMM` for building an HMM from 
a set of training sequences.
The following code derives a simple HMM from our single sequence of dice rolls 
with its known state path (stored as the 'names' attribute of the sequence).   

```{r, fig.width=7.15, fig.height=4}
y <- deriveHMM(list(casino), logspace = FALSE)
plot(y, textexp = 1.5)

### Optionally add the transition probabilities as text
text(x = 0.02, y = 0.5, labels = round(y$A["Fair", "Fair"], 2))
text(x = 0.51, y = 0.5, labels = round(y$A["Loaded", "Loaded"], 2))
text(x = 0.5, y = 0.9, labels = round(y$A["Fair", "Loaded"], 2))
text(x = 0.5, y = 0.1, labels = round(y$A["Loaded", "Fair"], 2))
```

**Figure 3: A simple HMM derived from the sequence of 300 dice rolls.**
As in Fig. 1, transition probabilities are shown as weighted lines and 
emission probabilities as horizontal grey bars. 

This appears to be fairly close to the actual model, despite the fact that the 
training data consisted of just a single sequence.
One would typically derive an HMM from a list of many such sequences (hence why 
the input argument is a list and not a vector) but this example is simplified 
for clarity. 


#### Profile Hidden Markov Models 
A profile hidden Markov model is an extension of a standard HMM, 
where the emission and transition probabilities are *position specific*. 
That is, they can change at each point along the sequence. 
These models typically have many more parameters than their simpler 
HMM counterparts, but can be very powerful for sequence analysis. 
The precursor to a profile HMM is normally a multiple sequence alignment.
Each column in the alignment will often (but not always) be represented by 
one internal position or "module" in the model, with each module consisting of 
three states: 

* a silent *delete* state that does not emit residues. 
* an *insert* state with emission probabilities reflecting the background
residue frequencies averaged over the entire alignment.
* a *match* state with emission probabilities reflecting the residue frequencies 
in the alignment column. 


Figure 4 shows the three state types listed above as circles diamonds and rectangles, 
respectively.
The states are linked by transition probabilities shown as weighted lines in the graph. 

Consider this small partial alignment of amino acid sequences from Durbin
et al [-@Durbin1998] chapter 5.3:

```{r}
data(globins)
globins
```

Position-specific patterns include a high probability of observing a "V"
at position 1 and an "A" or "G" at position 3. 
When tabulating the frequencies it is also prudent to add pseudo-counts, 
since the absence of a particular transition or emission type does not 
preclude the possibility of it occurring in another (unobserved) sequence. 
Pseudo-counts can be Laplacean (adds one of each emission and transition type), 
background (adjusts the Laplacean pseudo-counts to reflect the background frequencies 
derived from the entire alignment), or user defined, which can include 
more complex pseudo-count schemes such as Dirichlet mixtures [@Durbin1998].  
The default option for the derivePHMM function is "background".

The following code derives a profile HMM from the globin data and plots the model:

```{r, fig.width=7.15, fig.height=5}
globins.PHMM <- derivePHMM(globins, residues = "AMINO", pseudocounts = "Laplace")
plot(globins.PHMM)
```

**Figure 4: Profile HMM derived from a partial globin sequence alignment.**
Match states are shown as rectangles, insert states as diamonds, and delete 
states as circles. Grey horizontal bars represent the emission 
probabilities for each residue in the alphabet (in this case the amino acid 
alphabet) at each position in the model. Numbers in the delete states are 
simply model module numbers, while those in the insert states are the 
probabilities of remaining in the current insert state at the next emission 
cycle. Lines are weighted and directed where necessary to reflect the 
transition probabilities between states.
The large "B" and "E" labels represent are the silent begin and end states, 
respectively.

Note that there are only 8 internal modules (excluding the begin and end states), 
while the alignment had 10 columns. 
The `derivePHMM` function decided (using the maximum *a posteriori* algorithm)
that there was not enough residue information in columns 4 and 5 of the alignment 
to warrant assigning them internal modules in the model.
Instead, the last sequence in the alignment (GLB1_GLYDI) was considered to have 
entered the insert state at position 3 where it remained for two emission cycles
(emitting an "A" and a "D") before transitioning to the match state in module 4.
We can show this by calculating the optimal path of that sequence through the 
model, again using the Viterbi algorithm:

```{r}
path <- Viterbi(globins.PHMM, globins["GLB1_GLYDI", ])$path
path
```
The "path" element of the Viterbi object is an integer vector with elements taking 
values 0 ("delete"), 1 ("match") or 2 ("insert"). 
The path can be expressed more intuitively as characters instead of indices as follows:

```{r}
c("D", "M", "I")[path + 1]
```

Note that the addition of 1 to each path element is simply to convert 
from the C/C++ indexing style (which begins at 0) to R's style. 

Sequences do not need to be aligned to produce a profile HMM.
The function `derivePHMM` can optionally take a list of unaligned
sequences, in which case the longest sequence is used as a 'seed'
to create a preliminary profile HMM, and the model is iteratively
trained with the sequence list using either the Baum Welch or Viterbi 
training algorithm (see model training section below). 

##### File I/O
Profile HMMs can be exported as text files in the HMMER v3 format
(<http://www.hmmer.org/>) using the function `writePHMM`. 
For example, the small globin profile HMM can be exported by running 
`writePHMM(globins.PHMM)`. 
Similarly, a HMMER v3 text file can be parsed into R as an object of 
class "PHMM" with the function `readPHMM`. 

##### Sequence Simulation
To simulate data with random variation, the **aphid** package features the 
function `generate` with methods for both HMMs and PHMM objects. 
Sequences are generated recursively using the transition and emission probabilities
from within the model. There are two compulsory arguments, a model 
(object class "HMM" or "PHMM") and the "size" argument, which specifies
the maximum length of the sequence (this prevent an overflow situation 
that can occur if insert-insert transition probabilities are relatively high). 
For example, the following code simulates a list of 10 random sequences from 
the small globin profile HMM:

```{r}
sim <- list(length = 10)
suppressWarnings(RNGversion("3.5.0"))
set.seed(9999)
for(i in 1:10) sim[[i]] <- generate(globins.PHMM, size = 20)
sim
```
Note that the names attributes specify which state each residue was emitted 
from, and gap symbols are emitted from delete states. 
If these gaps are not required they can be removed as follows:
```{r}
sim <- lapply(sim, function(s) s[names(s) != "D"])
```

##### Model Training
The **aphid** package offers the function `train` for optimizing model
parameters using either the Baum Welch or Viterbi training algorithm.
Both are iterative refinement algorithms; the former
does not rely on a multiple sequence alignment
but is generally much slower than the latter. 
The Viterbi training operation can be sped up further 
by specifying the "cores" argument for parallel processing. 
The best choice of training algorithm will generally depend on the 
nature of the problem and the computing resources available. 
For more information see Durbin et al [-@Durbin1998] chapter 3.3 for 
standard HMMs and chapter 6.5 for profile HMMs.

The following code trains the small globin profile HMM with the sequences 
simulated in the previous step using the Baum Welch algorithm.
```{r}
globins2.PHMM <- train(globins.PHMM, sim, method = "BaumWelch", 
                       deltaLL = 0.01, seqweights = NULL)
```

As shown in the feedback (which can be switched off by setting "quiet = TRUE"),
this operation took 7 expectation-maximization iterations to converge to the 
specified delta log-likelihood threshold of 0.01.

## Sequence Alignment
The **aphid** package can be used to produce high-quality multiple sequence 
alignments using the iterative model training method outlined above. 
The function `align` takes as its primary argument a list of sequences 
either as a "DNAbin" object, an "AAbin" object, or a list of character sequences. 
An object of class "PHMM" can be passed to the function as an optional secondary 
argument ("model"), in which case the sequences are simply aligned to the model 
to produce the alignment matrix. 
If "model" is NULL, a preliminary model is first 
derived using the 'seed' sequence method outlined above, after which the model 
is trained using either the Baum Welch or Viterbi training algorithm 
(specified *via* the "method" argument). 
The sequences are then aligned to the model in the usual fashion to produce the 
alignment.
Note that if only two sequences are present inthe input list, the `align` 
function will perform a pairwise alignment without a profile HMM 
(Smith-Waterman or Needleman-Wunch alignment).

In this final example, we will deconstruct the original globin alignment and 
re-align the sequences using the original PHMM as a guide.

```{r}
globins <- unalign(globins)
align(globins, model = globins.PHMM, seqweights = NULL, residues = "AMINO")
```
Note that the column names show the progressive positions along the model
and where residues were predicted to have been emitted by insert states 
(e.g. the 4th and 5th residues of sequence 7).   

## Further Reading
This package was written based on the algorithms described in 
Durbin et al [-@Durbin1998].
This book offers an in depth explanation of hidden Markov models 
and profile HMMs for users of all levels of familiarity. 
Many of the examples and datasets in the package are directly 
derived from the text, which serves as a useful primer for this 
package.
There are also excellent resources available for those wishing to use 
profile HMMs outside of the R environment. 
The aphid package maintains compatibility with the HMMER software suite 
through the file input and output functions readPHMM and writePHMM. 
Those interested are further encouraged to check out the SAM software 
package, which also features a comprehensive suite of functions and 
tutorials.

## Acknowledgements
This software was developed at Victoria University of Wellington, NZ, 
with funding from a Rutherford Foundation Postdoctoral Research Fellowship 
award from the Royal Society of New Zealand.

## References
