---
title: "umiAnalyzer vignette"
author: "Stefan Filges"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_depth: 2
    number_sections: true
    theme: sandstone
    highlight: tango
    code_folding: show
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{umiAnalyzer vignette}
  %\usepackage[utf8]{inputenc}
---

```{css, echo=FALSE}
body .main-container {
  max-width: 1600px !important;
  width: 1280px !important;
}
body {
  max-width: 1600px !important;
}
```

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=9,
  fig.height=6
)
```

# Introduction

umiAnalyzer provides tools for analyzing UMIErrorCorrect output. UMIErrorCorrect
is a Python package for processing targeted high-throughput sequencing data containing
unique molecular identifiers (UMIs). UMIErrorCorrect is available at:
<https://github.com/stahlberggroup/umierrorcorrect>.

## Using the umiVisualizer shiny app

The shiny app can be run using the following command:

```{r runApp, eval=FALSE}
umiAnalyzer::runUmiVisualizer()
```


## How to make your own UMIexperiment object {.tabset}

Define a variable containing the path to the directory with all the UMIErrorCorrect 
output folders belonging to your experiment. umiAnalyzer comes with raw test data 
generated with UMIErrorCorrect that you can import if you don't have any of your own.

Call the createUmiExperiment to create your UMIexperiment object.

The UMIexperiment object always maintains your raw data, however you can create as many filters as you
like, which will be saved as separate objects to access. You can filter the consensus table of
UMIexperiment object with filterUmiObject. The only mandatory arguments are the object to be filtered
and a user defined name. You can use that name to retrieve a filtered table using getFilteredData. 

```{r example1, eval=TRUE}
library(umiAnalyzer)

main <- system.file("extdata", package = "umiAnalyzer")

simsen <- createUmiExperiment(main)

simsen <- mergeAssays(
  object = simsen,
  name = "new",
  assay.list = c("PIK3CA_123", "PIK3CA_234")
)
```

createUmiExperiment has an optional flag importBam, which is set to FALSE by default. 
If this is set to TRUE it will automatically call parseBamFiles and store the
read data in the 'simsen' object which can then be passed directly to plotFamilyHistogram,
without having to run parseBamFiles again. Note that for large experiments, 
especially if consDepth is set lower than 10, the size of the experiment object
may become too large.

```{r bam-files, eval=TRUE}
reads <- parseBamFiles(main, consDepth = 10)

BarcodeFamilyHistogram(reads)
```

Next we generate Quality Control plots and filter the UMIexperiment object to
select for consensus 3 reads. 

```{r example1continued, eval=TRUE}
QCplot(simsen)
```

Most downstream functions use filtered data, each each filtered 
data set receives a name, making it possible to apply multiple filters on the same object as the
original data is maintained and can always be retrieved with saveConsData (see below). Each filtered
data set can be retrieved using getFilteredData and assigned to a new variable or be saved as a csv file
with the optional parameter save = TRUE, which is set to FALSE by 

```{r filerobject, eval=TRUE}
simsen <- filterUmiObject(simsen)
```

Optionally, the beta-binomial variant caller can be run on the data using callVariants.

```{r example1continued_2, eval=TRUE}
# This is optional
simsen <- callVariants(simsen)
```

Retrieve filters using getFilteredData:

```{r getFilter, eval=TRUE}
myfilter <- getFilteredData(simsen)
myfilter
```

Error correction depends on UMI depth, i.e. how many reads per barcode are 
required to form a barcode family. We can plot the number of available reads
for different consensus depths.

```{r example1continued_3, eval=TRUE}
UmiCountsPlot(simsen)
```

Next we generate plots for the amplicons and samples in the UMIexperiment object.

If generateAmpliconPlots is called and the number of amplicons and samples is too
large to plot all of them individually in a single plot the data is shown in
summarized form. The user can specify amplicons and or samples optionally to
plot only the selection.

### All data

```{r, eval=FALSE}
AmpliconPlot(simsen)
```

### Selection

```{r, eval=FALSE}
AmpliconPlot(
  object = simsen,
  amplicons = 'KIT_125',
  samples = 'VAF-1-5ng-1-10x'
)
```

## Merging technical replicates

In the above analyses potential replicates are treated as individual samples, but
it is possible to merge replicate information for statistical analyses and
more convenient plotting.

Merging data requires the user to supply a file with meta data using the 
importDesign function. This will create a metadata attribute called "design",
which can be retrieved using getMetaData.

```{r metaDataImport, eval=FALSE}
metaData <- system.file("extdata", "metadata.txt", package = "umiAnalyzer")

simsen <- importDesign(
  object = simsen,
  file = metaData
)

design <- getMetaData(
  object = simsen, 
  attributeName = "design"
)

design
```

## Time series data

We can generate time course data using the analyzeTimeSeries function. This
requires some information from the meta data that was uploaded above. Most
importantly a time variable needs to be available in the meta data object called
"time" by default. However the variable can have any name, which can be specified
using the time.var parameter. If time.var has a date format that will be used 
for plotting, otherwise it will convert time.var to a categorical. If you have
time.var in date format but want the plot to use categorical instead set the
categorical parameter to TRUE. 

We can also make use of replicate data to remove background noise. If you specify
a group.by variable from the meta data only variants that occur at least twice
per group will be used. If we specify 'replicate' that means only variants in
at least 2 out of 3 replicates will be considered. The default value for
group.by is NULL and each sample will be analyzed and plotted independently.

```{r time_course, eval=FALSE}
simsen <- umiAnalyzer::analyzeTimeSeries(
  object = simsen,
  time.var = 'time',
  group.by = 'replicate'
)
```

We can now use the mergeTechnicalReplicates function to create a summarized
data table. Merging replicates also performs coverage normalization. This can 
remove bias incurred when comparing samples sequenced at different read depths,
because background noise is expected to be higher in a sample that was sequenced
to a greater depth. A drawback is that it is not possible to scale a count of
zero variant reads one can set all 0 to a small value, such as 0.5, first and then
scale the data.

```{r replicates, eval=FALSE}
simsen <- mergeTechnicalReplicates(
  object = simsen,
  group.by = 'replicate',
  option = 'Set1', 
  amplicons = c('new', 'TP53_1', 'TP53_7')
)
```

We can also have a look at the merged amplicon plots which now show the average
maximum alternate allele count and standard deviation.

```{r vizMerge, eval=FALSE}
vizMergedData(
  simsen,
  amplicons = c('new')
)
```

# Calling variants

umiAnalyzer comes with a build-in UMIexperiment object to explore, which was generated using the code 
above, so it can be used without creating the it first if so desired.

In order to call variants using the umiAnalyzer variant caller simply load the package and test data
and use the callVariants function. You can then filter the resulting consensus data (cons.data) within
the object, e.g. for significant variants.

```{r example2, eval=FALSE}
simsen <- callVariants(simsen)

simsen <- filterVariants(simsen)

simsen <- generateAmpliconPlots(
  object = simsen
)
```


# Other functions (experimental)

## Writing a Variant Call File (VCF)

```{r vcf, eval=FALSE}
generateVCF(object = simsen, outFile = 'simsen.vcf', printAll = FALSE, save = FALSE)
```

## Saving consensus data as a csv file

It is possible to retrieve consensus data as a tibble using saveConsData. This function can also be used to to save the data
as a csv file with delimiters either ',', ';' or tab. This requires the user to set the save parameter to TRUE and to specify an
output directory (default is the working directory), filename (default is 'consensus_data.csv') and delimiter (default is ';').

```{r csv, eval=FALSE}
consensus_data <- saveConsData(object = simsen)

outDir <- getwd()
saveConsData(object = simsen, outDir = outDir, delim = ';', save = TRUE)
```

# System info

```{r sessionInfo}
sessionInfo()
```