---
title: "ensembleTax package vignette"
author: "D Catlett"
date: "5/18/2021"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ensembleTax package vignette}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# ensembleTax overview
ensembleTax is an R package that allows incorporation of information from 
multiple taxonomic assignment algorithms and/or reference databases to compute 
ensemble taxonomic assignments for ASVs/OTUs generated by common marker gene 
sequence analyses. The package was built to conveniently compliment the dada2 R
package and pipeline, but can be used with any combination of taxonomic 
assignment algorithms and reference databases if the data can be read into R and
converted to a dataframe.

Here you can find instructions for downloading and installing the ensembleTax 
package, and a brief demonstration of a sample ensembleTax "pipeline".

## The problem(s) addressed by ensembleTax
Taxonomic assignment of marker gene sequences is a critical step of marker gene
workflows as it imparts ecological significance and understanding to genetic 
data.

Many taxonomic assignment algorithms have been proposed to assign taxonomy to 
marker gene sequences (or OTUs/ASVs). Similarly, analysts are often forced to 
choose from one of several reference databases containing representative 
marker gene sequences with known taxonomic identities. The "best" assignment 
algorithm and/or reference database for a particular scientific question is 
often not obvious. To complicate things further, different reference databases 
generally do not share consistent taxonomic naming or ranking conventions. 

ensembleTax solves this problem by providing flexible algorithms that synthesize
information from multiple taxonomic assignment algorithm/reference database 
combinations and compute a single ensemble taxonomic assignment for each ASV/OTU
in a marker gene data set. 

## ensembleTax algorithms
The core algorithms employed by ensembleTax are *taxmapper* and 
*assign.ensembleTax*. *taxmapper* maps, or 'translates', one taxonomic 
nomenclature onto another by exact name matching. *taxmapper* is rank-agnostic, 
meaning it does not consider the hierarchical structure of a taxonomy and 
assumes that a taxonomic name means the same thing regardless of which reference
database employs it (there are times when this assumption is not valid, for 
example when "Clade_X" or similar is used as a stand-alone taxonomic name, but 
*taxmapper* can handle these cases without introducing errors in taxonomic 
assignments).

*assign.ensembleTax* computes ensemble taxonomic assignments based on 
assignments determined by any number of individual taxonomic assignment 
algorithm/reference database combinations. *assign.ensembleTax* includes several
user-tuneable parameters that balance obtaining assignments at lower taxonomic 
ranks for a larger number of ASVs (at the expense of a likely increase in 
false-positive annotations) with obtaining more robust assignments for fewer 
ASVs that are supported by multiple methods.

Additional functions are included for pre-processing taxonomic assignments 
generated by specific taxonomic assignment algorithms and reference databases. 
These functions are designed to conveniently plug in downstream of the dada2 
pipeline, but other pipelines may be used if the data is formatted properly for 
use with *taxmapper* and/or *assign.ensembleTax* (see the documentation for the 
objects these algorithms can work with).

The outputs of the following taxonomic assignment algorithms are explicitly 
supported by ensembleTax:

1. RDP bayesian classifier as implemented in dada2's assignTaxonomy.
2. idtaxa algorithm as implemented in DECIPHER. 

Supported reference databases include:

1. Silva SSU NR reference database v138 (silva).
2. Protistan Ribosomal Reference database v4.12.0 (pr2).
3. RDP train set v16
4. GreenGenes v13.8 clustered at 97% similarity

Note that other databases may still be used with ensembleTax, but they must be 
mapped onto the taxonomic nomenclatures employed by one of the above supported 
databases using *taxmapper*, or the user must extract all unique taxonomic 
assignments from the database and format them properly for use with *taxmapper*.
A vignette that shows how to extract taxonomic assignments from a database not 
supported by the ensembleTax package can be found here:
https://github.com/dcat4/ensembleTax/blob/master/add_tax_dbs.md

A vignette that shows how to incorporate taxonomic assignments produced outside
of R into an ensembleTax workflow can be found here:
https://github.com/dcat4/ensembleTax/blob/master/add_tax_tab_from_csv.md

### ensembleTax 'pipeline' demonstration
Here we step through a simple example of an ensembleTax workflow to compute 
ensemble taxonomic assignments for a small set of 18S-V9 protist ASVs.

Since ensembleTax is built to plug in directly downstream of the dada2 pipeline,
to start the ensembleTax pipeline demonstration we include here code that can be 
used to generate initial sets of taxonomic assignments using the RDP Bayesian 
classifier implemented in dada2's assignTaxonomy and DECIPHER's idtaxa 
algorithm. The below snippet comes directly from the dada2 tutorial. Please see 
the dada2 pipeline tutorial for further information:
https://benjjneb.github.io/dada2/tutorial.html

```{r, eval=FALSE}
# dada2 assignTaxonomy:
taxa <- assignTaxonomy(seqtab.nochim, "~/tax/silva_nr_v132_train_set.fa.gz", multithread=TRUE)
# this is optional and was not used in the example data below:
taxa <- addSpecies(taxa, "~/tax/silva_species_assignment_v132.fa.gz") 

# DECIPHER idtaxa:
library(DECIPHER); packageVersion("DECIPHER")
dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs
load("~/tax/IDTaxa/SILVA_SSU_r132_March2018.RData") # CHANGE TO THE PATH OF YOUR TRAINING SET
ids <- IdTaxa(dna, trainingSet, strand="top", processors=NULL, verbose=FALSE) # use all processors
```

We encourage creating an ASV "rubric" to track ASVs through the ensembleTax 
pipeline. The rubric is a DNAStringSet (see the Biostrings package) object 
produced by extracting ASV sequences from the seqtab output by dada2, and giving 
them arbitrary names (like sv1, sv2, etc). You can create a "rubric" from an 
"ASV table" output by dada2 by doing the following:

```{r, eval=FALSE}
rubric <- DNAStringSet(getSequences(seqtab.nochim))
# this creates names (sv1, sv2, ..., svX) for each ASV
snam <- vector(mode = "character", length = length(rubric))
for (i in 1:length(rubric)) {
    snam[i] <- paste0("sv", as.character(i))
}
names(rubric) <- snam
```

For this vignette, instead of running the assignments we'll just load some data 
included with the ensembleTax package. These are outputs of dada2's 
assignTaxonomy implemented against pr2, and of DECIPHER's idtaxa implemented 
against both pr2 and silva. The rubric.sample is an example of a "rubric", which
ensembleTax uses to track ASV-identifying information.

```{r}
library("ensembleTax")
library("Biostrings")

data("idtax.pr2.sample")
data("idtax.silva.sample")
data("bayes.sample")
data("rubric.sample")
```

If you want to see what they look like, do this:

```{r, eval = FALSE}
head(idtax.pr2.sample)
head(idtax.silva.sample)
head(bayes.sample)
head(rubric.sample)
```

#### ensembleTax pre-processing
We see from the above that the data structures returned by our two taxonomic 
assignment algorithms are different. It is critically important that the order 
of sequences in the rubric and in the idtaxa-returned Taxon object are the same.
idtaxa does not return sequence names, but if you did not alter the order of 
your sequences as you provided them to idtaxa and/or DNAStringSet when creating 
your rubric, the ordering should be preserved and you should be good to go. 

Here we'll run these tables through ensembleTax's pre-processing functions. 
Supplying a rubric allows ensembleTax to give each taxonomy table the same 
ASV-identifying information and to better track and organize your data.

```{r}
idtax.pr2.pretty <- idtax2df(idtax.pr2.sample, 
                             db = "pr2", 
                             ranks = NULL,
                             boot = 50,
                             rubric = rubric.sample,
                             return.conf = FALSE)
idtax.silva.pretty <- idtax2df(idtax.silva.sample, 
                             db = "silva", 
                             ranks = NULL,
                             boot = 50,
                             rubric = rubric.sample,
                             return.conf = FALSE)
bayes.pr2.pretty <- bayestax2df(bayes.sample, 
                             db = "pr2", 
                             ranks = NULL,
                             boot = 50,
                             rubric = rubric.sample,
                             return.conf = FALSE)
```

Again to view the data, do this:

```{r}
# remove ASV columns from tax tables for easier viewing:
idtax.pr2.pretty <- idtax.pr2.pretty[ , -which(names(idtax.pr2.pretty) %in% c("ASV"))]
idtax.silva.pretty <- idtax.silva.pretty[ , -which(names(idtax.silva.pretty) %in% c("ASV"))]
bayes.pr2.pretty <- bayes.pr2.pretty[ , -which(names(bayes.pr2.pretty) %in% c("ASV"))]

head(idtax.pr2.pretty)
head(idtax.silva.pretty)
head(bayes.pr2.pretty)
```

We see that each taxonomy table is now a dataframe sorted by the column "svN". 

#### The taxmapper algorithm
After pre-processing our taxonomic assignment data sets above, we see we still
can't make apples-to-apples comparisons between the "idtax-silva" table and the 
other two because they employ different ranking and (though this may not be as 
obvious) naming conventions. *taxmapper* was created to solve this problem. 

Here we'll use *taxmapper* to 'translate' the idtax-silva taxonomic assignments 
onto the same taxonomic nomenclature as the other two tables.

For more detailed examples illustrating the behavior of *taxmapper* with 
different input arguments, a vignette devoted exclusively to *taxmapper* is 
included with the package documentation, and on Github.

Note that if you want to use your own custom synonym.file, there's a vignette 
showing how to do this here: 
https://github.com/dcat4/ensembleTax/blob/master/how_to_add_synonyms.md

```{r}

idtax.silva.mapped2pr2 <- taxmapper(idtax.silva.pretty,
                      tt.ranks = colnames(idtax.silva.pretty)[2:ncol(idtax.silva.pretty)],
                      tax2map2 = "pr2",
                      exceptions = c("Archaea", "Bacteria"),
                      ignore.format = TRUE,
                      synonym.file = "default",
                      streamline = TRUE,
                      outfilez = NULL)
head(idtax.silva.mapped2pr2)

```

Inspection of the mapped taxonomy table shows that it now mirrors the naming and
ranking conventions of the other two taxonomy tables.

#### The assign.ensembleTax algorithm
Now we have three different taxonomy tables with independent taxonomic 
assignments for each ASV in our example data set. From these we can compute 
ensemble taxonomic assignments with the *assign.ensembleTax* algorithm.

Here's a run with the default parameters:

```{r}

xx <- list(idtax.pr2.pretty, idtax.silva.mapped2pr2, bayes.pr2.pretty)
names(xx) <- c("idtax-pr2", "idtax-silva", "bayes-pr2")
eTax1 <- assign.ensembleTax(xx, 
                     tablenames = names(xx), 
                     ranknames = c("kingdom", "supergroup", "division","class","order","family","genus","species"),
                     tiebreakz = NULL, 
                     count.na=TRUE, 
                     assign.threshold = 0, 
                     weights=rep(1,length(xx)))
# show the 3 individuals again for easy viewing:
lapply(xx, FUN = head)
head(eTax1)

```

We see that the assignments made at the highest frequency across the 3 
assignment algorithms are assigned as the ensemble taxonomic assignment. 

For more detailed examples illustrating the behavior of *assign.ensembleTax* 
with different input arguments, please consult the other vignettes included in 
the ensembleTax package.

#### Comparisons of ensemble assignments with individual methods

As you can see above and in the *assign.ensembleTax* vignette, ensemble 
taxonomic assignments can be computed in different ways depending on your 
scientific objectives. We encourage you to try a few different settings that you 
think are appropriate for your science, and compare your ensemble assignments 
with those predicted by individual methods. 

We demonstrate some possibilities for doing so below. We'll use the reshape2 and
ggplot2 packages to help with this, so install those if you don't have them. 

First we compare the proportion of ASVs that are unassigned (assigned NA) at 
each rank:

```{r, eval=FALSE}
install.packages(c("ggplot2", "reshape2"))
```
```{r}
library("ggplot2")
library("reshape2")

nasum <- function(taxdf){
    notuz <- nrow(taxdf)
    x <- is.na(taxdf)
    ii <- colSums(x) / notuz
    return(ii)
}
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73")
# create a list with individuals and ensemble(s):
xx <- list(idtax.pr2.pretty, idtax.silva.mapped2pr2, bayes.pr2.pretty, eTax1)
names(xx) <- c("idtax-pr2", "idtax-silva", "bayes-pr2", "ensemble")
# remove meta-data columns (NOTE the hard-coded column removal!!!)
xx <- lapply(xx, function(x) x[, -c(1)])
tina <- lapply(xx, nasum)
yaboi <- matrix(unlist(tina), nrow=length(xx), byrow=TRUE)
rownames(yaboi) <- names(xx)
colnames(yaboi) <- colnames(xx[[1]])
yaboi <- as.data.frame(t(yaboi))
yaboi$rankz <- rownames(yaboi)
yaboi <- melt(yaboi, id.vars = "rankz")
plt <- ggplot(yaboi, aes(fill = variable, x = rankz, y = value)) + 
    geom_bar(stat="identity", color = "black", position=position_dodge(width=0.8)) + 
    labs(x = "Taxonomic Rank", y = "Proportion of ASVs Unassigned") + 
    scale_x_discrete(limits = colnames(xx[[1]])) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.title.x = element_text(size = 12, face="bold"),
          axis.text.y = element_text(size = 12), axis.title.y = element_text(size = 12, face="bold"),
          panel.background = element_rect(fill = "white",
                                          colour = "white",
                                          linetype = "solid"),
          panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                          colour = "white"),
          panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                          colour = "white"),
          axis.line = element_line(size = 0.5, linetype = "solid", colour = "black"),
          panel.border = element_rect(colour = "black", fill=NA, size=1)) + 
    scale_fill_manual(name = "Taxonomy table",
                      breaks = names(xx),
                      values = cbPalette)
print(plt)
```

For our simple 5-ASV example in this vignette, you could more or less see these 
patterns just by manually inspecting the data. For thousands of ASVs though, 
this type of analysis can be useful. 

A couple things to note here: in most cases, if you've mapped assignments from 
one database onto another, some information will likely be lost because 
reference databases occasionally do not contain the same taxonomic names and/or 
organisms. Therefore you shouldn't read too much into it if assignments made 
with one reference database consistently feature a higher proportion of ASVs 
that remain unassigned. In this case, the idtax-silva was mapped onto the pr2 
nomenclature, so we don't really know whether the lower proportion of assigned 
ASVs is due to lower coverage of protists in silva, or due to a loss of 
information during mapping. We could of course map the pr2 tables onto silva and
compare the information lost in both directions to give a better idea if there 
are systematic differences in assignments made with silva vs. pr2, but this is 
for another day.

We can however see that the when applied to the same database, the RDP 
classifier classified more of our ASVs than the idtaxa classifier. This was 
noted in the idtaxa paper and may be due to high overclassification error rates
associated with the RDP classifier. Again though, we don't want to get too much 
into the weeds here.

Finally, we see that the ensemble is somewhere in the middle of the individual 
methods. This is because it has only assigned taxonomy where an assignment was 
supported by two assignment methods; if those were overclassification errors in 
the bayes-pr2 table, we've reduced the impact of those but it looks like we also 
had enough support to make taxonomy predictions in some places where the 
idtax-pr2 table was unassigned (and perhaps was too conservative?). This hints 
at some of the expected benefits of determining ensemble assignments, though 
further work is needed to validate these expectations.

We'll demonstrate one more comparison you might try. Here, we're comparing 
taxonomic annotations for each ASV in our data set according to each independent 
method with each ASV's ensemble taxonomic assignment. In these comparisons, 
we'll designate four possible outcomes for each ASV's taxonomic assignments: the
assignments predicted by the individual method can be in perfect agreement with 
the ensemble at all ranks; the individual method can assign taxonomy for an ASV 
at more or less ranks than the ensemble; or the taxonomy predicted by the 
individual method can disagree with the ensemble assignment at any rank. Let's 
do it:

```{r, warning=FALSE, message = FALSE}
library("dplyr")
# fcn that compares two taxonomy tables:
tblcomper <- function(y,x) {
  perf <- dplyr::intersect(x, y) # perfectly-matching rows (assignments)

  tmp.x <- dplyr::setdiff(x, perf)
  tmp.y <- dplyr::setdiff(y, perf)
  if (!identical(tmp.x[, 1], tmp.y[, 1])) {
    tmp.x <- ensembleTax::sort_my_taxtab(tmp.x, ranknames = colnames(tmp.x)[3:ncol(tmp.x)])
    tmp.y <- ensembleTax::sort_my_taxtab(tmp.y, ranknames = colnames(tmp.y)[3:ncol(tmp.x)])
  }
  xna <- is.na(tmp.x)
  yna <- is.na(tmp.y)
  # This warning happens 1 million times when there's no NA's. It doesn't matter b/c inf is always bigger so suppressing.
  ## Warning in min(which(z)): no non-missing arguments to min; returning Inf
  x.minna <- apply(xna, MARGIN = 1, function(z) suppressWarnings(min(which(z))))
  y.minna <- apply(yna, MARGIN = 1, function(z) suppressWarnings(min(which(z))))
  x.mo <- which(x.minna > y.minna)
  y.mo <- which(y.minna > x.minna)
  
  yunder.i <- c()
  yover.i <- c()
  mis.i <- c()
  
  # subset where x has more resolved assignments and only to cols where 
  # both have assignments, then see if names match
  yunder <- 0
  ymis <- 0
  if (length(x.mo) > 0){
    for (i in 1:length(x.mo)){
      if ((y.minna[x.mo[i]]-1) < 3){ # if kingdom is unassigned just add to under-classification
        yunder <- yunder+1
      } else {
        tmp.tmpx <- tmp.x[x.mo[i] , 3:(y.minna[x.mo[i]]-1)]
        tmp.tmpy <- tmp.y[x.mo[i] , 3:(y.minna[x.mo[i]]-1)]
    
        if (all(tmp.tmpx == tmp.tmpy)) {
          yunder <- yunder+1
        } else {
          ymis <- ymis+1
        }
      }
    }
  }
  
  # repeat above where y is more resolved than x:
  yover <- 0
  xmis <- 0
  if (length(y.mo) > 0){
    for (i in 1:length(y.mo)){
      if ((x.minna[y.mo[i]]-1) < 3){ # if kingdom is unassigned just add to under-classification
        yover <- yover+1
      } else {
        tmp.tmpx <- tmp.x[y.mo[i] , 3:(x.minna[y.mo[i]]-1)]
        tmp.tmpy <- tmp.y[y.mo[i] , 3:(x.minna[y.mo[i]]-1)]
    
        if (all(tmp.tmpx == tmp.tmpy)) {
          yover <- yover+1
        } else {
          xmis <- xmis+1
        }
      }
    }
  }
  
  if (yover+xmis != length(y.mo) || yunder+ymis != length(x.mo)) {
    stop("somethings wrong i think")
  }
  
  perf <- nrow(perf)
  # whatever's left is where both tables have the same ranks named, but different names. this is a misclassification:
  moremis <- nrow(x) - (xmis+ymis+yover+yunder+perf)
  result <- data.frame(all.match = c(perf), mis = c(ymis+xmis+moremis), over = c(yover), under = c(yunder))
  if (sum(result) == nrow(x)) {
    return(result)
  } else {
    stop("noooooooooooooo")
  }
}

# make a list of all individual taxonomy tables (not the ensemble):
tbl.list <- list(idtax.pr2.pretty, idtax.silva.mapped2pr2, bayes.pr2.pretty)
all.comp <- lapply(tbl.list, FUN = tblcomper, x = eTax1) # apply above comparison fcn to all idnividual tables
all.comp <- base::do.call(base::rbind.data.frame, all.comp) # clean results
# name the rows (order matches the order you put them into the list)
row.names(all.comp) <- c("idtax-pr2", "idtax-silva", "bayes-pr2")
all.comp <- all.comp / rowSums(all.comp) # convert to proportions

# prep data for plotting:
all.comp$tbl <- row.names(all.comp)
plt.all.comp <- melt(all.comp)

# plot the results:
plt2 <- ggplot(plt.all.comp, aes(fill = tbl, x = variable, y = value)) + 
    geom_bar(stat="identity", color = "black", position=position_dodge(width=0.8)) + 
    labs(x = "Relative to ensemble", y = "Proportion of ASVs") + 
    scale_x_discrete(breaks = c("all.match","mis","over","under"), 
                   labels = c("Agree (all ranks)", "Disagree (any rank)", "More ranks assigned","Less ranks assigned")) +
    scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0),
                   limits = c(0, 1), expand = c(0,0)) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.title.x = element_text(size = 12, face="bold"),
          axis.text.y = element_text(size = 12), axis.title.y = element_text(size = 12, face="bold"),
          panel.background = element_rect(fill = "white",
                                          colour = "white",
                                          linetype = "solid"),
          panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                          colour = "white"),
          panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                          colour = "white"),
          axis.line = element_line(size = 0.5, linetype = "solid", colour = "black"),
          panel.border = element_rect(colour = "black", fill=NA, size=1)) + 
    scale_fill_manual(name = "Taxonomy table",
                      breaks = c("idtax-pr2", "bayes-pr2", "idtax-silva"),
                      values = cbPalette[c(1:3)])
print(plt2)
```

And we have a few things to point out here. First, note that for each taxonomy 
table (bar color), the bars sum to 1. You could thus employ a stacked bar or 
similar, but I prefer this format; see the ggplot2 help for tweaking the plot. 
For interpretation's sake, this just means that each ASV's assignment comparison
must fall into one of the four categories we designated.

We also see that in this very small data set, we had no ASVs for which different 
taxonomic assignments were predicted by the individual methods. There were of 
course instances where one method assigned taxonomy to more or fewer ranks, but
this lack of disagreement is a good thing. Rates of conflicting assignments tend
to be low (< 5-10%) in the larger data sets I've worked with too. This of course
probably depends on the classifier and reference database employed, but is a 
good sign nonetheless.

One last thing to note: under certain implementations of *assign.ensembleTax*, 
assignments predicted by one or more methods can be intentionally prioritized by 
the user. Therefore, it is important to keep in mind that increased 
disagreements between an individual method and an ensemble does not necessarily 
indicate that the individual method is error prone.  

While this small data set doesn't lend itself to many more interesting 
interpretations, hopefully the above comparisons help you get started in 
assessing the utility and optimal parameters for your ensemble taxonomic 
assignments.

This brings us to the end of our vignette. Please see the vignettes specifically
geared toward demonstrating uses for the *taxmapper* and *assign.ensembleTax* 
algorithms for more information, and let us know on the issues page if/when you 
find issues!
