---
title: "taxmapper tutorial"
author: "D Catlett"
date: "5/18/2021"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{taxmapper tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

This vignette provides detailed examples to demonstrate the functionality of the
*taxmapper* algorithm included with the ensembleTax package. For a more general 
demonstration of the ensembleTax package functionality/workflow, please go here:
https://github.com/dcat4/ensembleTax

## The taxmapper algorithm
*taxmapper*'s purpose is to map a collection of taxonomic assignments onto a 
different taxonomic nomenclature (set of naming and ranking conventions). 

It does this via rank-agnostic exact name matching. In other words, *taxmapper* 
doesn't care about the heirarchical structure of a taxonomic nomenclature, and 
assumes that a taxonomic name means the same thing regardless of which reference 
database that name is found in. There are some exceptions to this when ambiguous
names are encountered; see Example 5 below for details on what constitutes an 
ambiguous name and how these are handled by ensembleTax.

### Examples
To demonstrate the functionality of *taxmapper*, we'll create an artificial set 
of ASVs and corresponding taxonomic assignments as well as an artificial 
taxonomic nomenclature that mimic's those available in the ensembleTax R 
package.

So first, load the ensembleTax package, and create the artificial data sets:

```{r}
library("ensembleTax")
packageVersion("ensembleTax")

# create a fake taxonomy table of ASVs and taxonomic assignments
fake.taxtab <- data.frame(ASV = c("CGTC", "AAAA"),
                          kingdom = c("Eukaryota", "Bacteria"), 
                          supergroup = c("Stramenopile", NA),
                          division = c("Ochrophyta", NA),
                          class = c("Diatomea", NA),
                          genus = c("Pseudo-nitzschia", NA))
# create a fake taxonomic nomenclature:
map2me <- data.frame(kingdom = c("Eukaryota"), 
                     largegroup = c("Stramenopile"),
                     division = c("Clade_X"),
                     class = c("Ochrophyta"),
                     order = c("Bacillariophyta"),
                     genus = c("Pseudonitzschia"))
# look at your artificial data:
fake.taxtab
map2me
```

So we see we have a set of 2 ASVs with taxonomic assignments, and a taxonomic 
nomenclature that contains 1 taxonomic entry (we're trying to make a simple 
example here; you'll have thousands of entries in each if you're doing this with
real data).

Now would be a good time to review the *taxmapper* documentation to get a sense 
of the different parameter spaces available. Here we'll try to demonstrate what 
these different parameters are doing.

#### Example 1: Strict exact name-matching and the "streamline" argument

To start, we'll run *taxmapper* with no exceptions, no format-ignoring, and no 
taxonomic synonyms, and we'll look at the different outputs you can expect based
on the *streamline* argument:

```{r}
mapped.tt.stmlin <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = NULL,
                       ignore.format = FALSE,
                       synonym.file = NULL,
                       streamline = TRUE)
mapped.tt.stmlin

mapped.tt.no.stmlin <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = NULL,
                       ignore.format = FALSE,
                       synonym.file = NULL,
                       streamline = FALSE)
mapped.tt.no.stmlin
```

We see that when *streamline = TRUE* we return a dataframe with the input ASV's 
and their mapped taxonomic assignments. This is intended for users who want to 
automate their ensembleTax workflow and move on with further analyses right 
away.

If you want to take a look "under the hood", setting *streamline = FALSE*, 
returns a 3-element list. [[1]] shows the input taxonomic assignments aligned 
with their mapped values (a sort of mapping "rubric"). Because *Bacteria* was 
not found in tax2map2, it does not have a taxonomy to map onto and is not 
included in the "rubric". [[2]] shows the taxonomic names that could not be 
mapped. We see that these are the names that were not found (or did not have exact matches to any name) in *tax2map2* (or "map2me" in this example). Finally, [[3]] contains the mapped input taxonomy table, which is identical to what was 
returned when *streamline = TRUE*.

We see that the "CGTC" ASV was mapped to *Ochrophyta*, despite the use of 
different ranking conventions in the input taxonomy table and the taxonomic 
nomenclature we're mapping onto. This illustrates the "rank-agnostic" part of 
*taxmapper*. The "AAAA" ASV is entirely unassigned in the mapped output because
our *tax2map2* didn't include *Bacteria*. If you'd like to retain a high-level 
taxonomic assignment like *Bacteria* in this example, you can address that with 
the *exceptions* argument.

#### Example 2: The "exceptions" argument

Here we'll specify that we want to keep *Bacteria* assignments even though 
they aren't included in *tax2map2*:

```{r}
mapped.tt.exc <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = c("Bacteria"),
                       ignore.format = FALSE,
                       synonym.file = NULL,
                       streamline = TRUE)
mapped.tt.exc
```

And we see that instead of a completely unassigned "AAAA" ASV as we had above, 
we've now retained the *Bacteria* assignment in the mapped output. 

#### Example 3: Incorporating taxonomic synonyms

Folks who study phytoplankton might recognize that *Diatomea* in our input 
taxonomy table and *Bacillariophyta* in the nomenclature we're mapping onto are 
taxonomic synonyms (both refer to the same class of phytoplankton, diatoms). 
*taxmapper* can search for taxonomic synonyms as well.

If you'd like to use a custom compilation of taxonomic synonyms, please see this
vignette: 
https://github.com/dcat4/ensembleTax/blob/master/how_to_add_synonyms.md.

ensembleTax includes a collection of pre-compiled eukaryotic taxonomic synonyms.
Let's have a look at whether *Diatomea* and *Bacillariophyta* are included in 
this pre-compiled data set:

```{r}
# load ensembleTax's pre-compiled synonyms:
syn.df <- ensembleTax::synonyms_v2
# pull rows with Diatomea (there's only 1)
diatom.synonyms <- syn.df[which(syn.df == "Diatomea", arr.ind=TRUE)[,'row'],]
# look at it:
diatom.synonyms
```

They are. You can follow a similar procedure to check for synonyms for your 
favorite taxonomic name, or enhance our pre-compiled synonym collection by 
saving the above syn.df dataframe to a csv and adding in your own collections of
synonyms.

Moving on, if we tell *taxmapper* to consult the pre-compiled taxonomic 
synonyms included with the ensembleTax package, we should be able to get more 
refined mapped taxonomic assignments in this example. We'll do this here with 
the *synonym.file = "default"* argument:

```{r}
mapped.tt.syn <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = c("Bacteria"),
                       ignore.format = FALSE,
                       synonym.file = "default",
                       streamline = TRUE)
mapped.tt.syn
```

Taking a look at this output, we see that the "CGTC" ASV has now been mapped to 
*Bacillariophyta*, despite the fact that it is called *Diatomea* in the fake 
reference database we used to generate our fake taxonomic assignments. So our 
inclusion of taxonomic synonyms has reduced the information lost in taxonomy 
mapping.

We have just one more parameter to check out...

#### Example 4: The "ignore.format" argument

You might have noticed in the examples above that our input taxonomy table 
includes an ASV assigned as *Pseudo-nitzschia*, while the nomenclature we're 
mapping to includes the same taxonomic name with no hyphen in the middle. This 
is where the ignore.format argument can be helpful: 

```{r}
mapped.tt.igfo <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = c("Bacteria"),
                       ignore.format = TRUE,
                       synonym.file = NULL,
                       streamline = TRUE)
mapped.tt.igfo
```

We see that setting *ignore.format = TRUE* has circumvented the formatting 
issue, and now we retain more information in our mapped annotations since we're 
able to map *Pseudo-nitzschia* onto *Pseudonitzschia*. Other special symbols 
handled with *ignore.format = TRUE* include " " (single space), "_", "-", "[", 
"]". It also reduces case sensitivity (attempts to map all-lower- and all-upper-
case variants of a taxonomic name). 

If you read the *ignore.format* documentation carefully, you may notice there 
are other circumstances where the *ignore.format* option doesn't work as 
cleanly. Here we'll show an example to illustrate. 

If the special characters *ignore.format* handles are found in *tax2map2*
rather than *tt*, the mapping won't work. We'll make a second fake.taxtab and 
map2me with the *Pseudonitzschia* variants swapped to demonstrate:

```{r}
fake.taxtab2 <- fake.taxtab
fake.taxtab2[fake.taxtab2 == "Pseudo-nitzschia"] <- "Pseudonitzschia"
map2me2 <- map2me
map2me2[map2me2 == "Pseudonitzschia"] <- "Pseudo-nitzschia"

fake.taxtab2
map2me2

mapped.tt.igfo2 <- taxmapper(tt = fake.taxtab2,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me2,
                       exceptions = c("Bacteria"),
                       ignore.format = TRUE,
                       synonym.file = NULL,
                       streamline = TRUE)
mapped.tt.igfo2
```

This example illustrates that formatting is only being ignored for the taxonomic 
names we're mapping, and NOT for the taxonomic nomenclature we're mapping onto. 
This is an important limitation to keep in mind. If you find this problematic, 
you may consider further customization of the *tax2map2* data. We are 
considering more detailed manipulations of the nomenclatures supported by 
ensembleTax to circumvent this issue but for now we supply these exactly as they
are supplied by the creators of the reference databases.

#### Example 5: ambiguous "placeholder" names

One last example we need to look at considers ambiguous taxonomic names that are
sometimes included in reference databases. 

Let's make a small adjustment to our fake.taxtab to see how these are handled by 
*taxmapper*. We'll add a "Clade_X" supergroup annotation to our prokaryotic ASV.

```{r}
# create a new fake taxonomy table of ASVs and taxonomic assignments
fake.taxtab <- data.frame(ASV = c("CGTC", "AAAA"),
                          kingdom = c("Eukaryota", "Bacteria"), 
                          supergroup = c("Stramenopile", "Clade_X"),
                          division = c("Ochrophyta", NA),
                          class = c("Diatomea", NA),
                          genus = c("Pseudo-nitzschia", NA))

# look at your artificial data again:
fake.taxtab
map2me
```

Re-inspecting map2me shows that "Clade_X" is also the name of a clade of 
Eukaryotic Stramenopiles. Ruh roh. This might introduce errors in the mapped 
taxonomic assignments since Clade_X is a name found in both a Bacterial and 
Stramenopile lineage.

Let's see what happens when we run taxmapper:

```{r}
mapped.tt.ambigtest <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = NULL,
                       ignore.format = TRUE,
                       synonym.file = NULL,
                       streamline = TRUE)
mapped.tt.ambigtest
```

We see that despite the fact that there was an exact name match, *taxmapper* has 
avoided making an incorrect annotation in the mapped output. *taxmapper* 
does this by checking the names to be mapped for taxonomic names that BEGIN with
certain words. Here's the complete list of what it checks for:
"Clade", "CLADE", "clade", "Group", "GROUP", "group", "Class", "CLASS", "class",
"Subgroup", "SubGroup", "SUBGROUP", "subgroup", "Subclade", "SubClade", 
"SUBCLADE", "subclade", "Subclass", "SubClass", "SUBCLASS", "subclass", 
"Sub group", "Sub Group", "SUB GROUP", "sub group", "Sub clade", "Sub Clade", 
"SUB CLADE", "sub clade", "Sub class", "Sub Class", "SUB CLASS", "sub class",
"Sub_group", "Sub_Group", "SUB_GROUP", "sub_group", "Sub_clade", "Sub_Clade", 
"SUB_CLADE", "sub_clade", "Sub_class", "Sub_Class", "SUB_CLASS", "sub_class", 
"Sub-group", "Sub-Group", "SUB-GROUP", "sub-group", "Sub-clade", "Sub-Clade", 
"SUB-CLADE", "sub-clade", "Sub-class", "Sub-Class", "SUB-CLASS", "sub-class", 
"incertae sedis", "INCERTAE SEDIS", "Incertae sedis", "Incertae Sedis", 
"incertae-sedis", "INCERTAE-SEDIS", "Incertae-sedis", "Incertae-Sedis", 
"incertae_sedis", "INCERTAE_-SEDIS", "Incertae_sedis", "Incertae_Sedis", 
"incertaesedis", "INCERTAESEDIS", "Incertaesedis", "IncertaeSedis", 
"unclassified", "UNCLASSIFIED", "Unclassified", "Novel", "novel", "NOVEL", 
"sp", "sp.", "spp", "spp.", "lineage", "Lineage", "LINEAGE"

So, what does *taxmapper* do when it encounters an ambiguous name like 
"Clade_X"? It doesn't just discard the name. Instead, it finds the lowest rank 
with a non-ambiguous taxonomic name (a name that doesn't begin with a word in 
the list above), and appends that non-ambiguous name to the ambiguous name, 
separated by a "-". In our example above, this means *taxmapper* was searching 
for "Bacteria-Clade_X" rather than just "Clade_X", removing the ambiguity in 
taxonomic identity. 

Here we'll add an annotation to our *tax2map2* (the map2me variable defined 
above) and see that, in some cases, we can use *ignore.format* to map the 
ambiguous "Clade_X" name assigned to our "AAAA" ASV:

```{r}
# add an entry in our tax2map2 that matches (but not exactly) one of our ASVs:
map2me <- rbind(map2me,
                c("Bacteria", "Bacteria_Clade_X", rep(NA, times = ncol(map2me)-2)))
map2me

# map again with ignore.format = FALSE.. the Bacteria will only map to Bacteria
mapped.tt.ambigtest2 <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = NULL,
                       ignore.format = FALSE,
                       synonym.file = NULL,
                       streamline = TRUE)
# confirm:
mapped.tt.ambigtest2

# now set ignore.format = TRUE.. we'll map to Bacteria Clade X:
mapped.tt.ambigtest3 <- taxmapper(tt = fake.taxtab,
                       tt.ranks = colnames(fake.taxtab)[2:ncol(fake.taxtab)],
                       tax2map2 = map2me,
                       exceptions = NULL,
                       ignore.format = TRUE,
                       synonym.file = NULL,
                       streamline = TRUE)
# confirm:
mapped.tt.ambigtest3

```

To clarify what's going on here one last time, when *taxmapper* encountered the
"Clade_X" assignment for the "AAAA" ASV, it appended the next-lowest 
non-ambiguous taxonomic assignment ("Bacteria") and searched for an 
exact match to this now-non-ambiguous name ("Bacteria-Clade_X"). When 
*ignore.format = FALSE*, "Bacteria-Clade_X" was not an exact match to 
"Bacteria_Clade_X" (the hyphen and underscore are different). But when 
*ignore.format = TRUE*, *taxmapper* searched for various formatting variants of 
"Bacteria-Clade_X", one of which is "Bacteria_Clade_X". This results in an exact 
match in *tax2map2* and a more refined mapped taxonomic annotation for this ASV.

You might notice that if an ambiguous name like "Clade_X" is found in 
*tax2map2*, we will NOT be able to map onto this taxonomic assignment under any 
circumstances with the current implementation of *taxmapper*. The strategy 
*taxmapper* uses here is based on inspection of the database nomenclatures 
included in ensembleTax and our desire to preserve the nomenclatures employed by 
different reference databases as closely as possible. Again, we are considering 
more detailed manipulations of the nomenclatures supported by ensembleTax to 
circumvent this issue but for now we supply these as they are supplied by the 
creators of the reference databases.

And that brings us to the end of this vignette. Please let us know about issues 
that come up on the esembleTax Github issues tracker.