---
title: "Genotyping Many SNPs with multidog()"
author: "David Gerard"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genotyping Many SNPs with multidog()}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=4.5,
  fig.height=3.5
)
```

# Abstract

`multidog()` provides support for genotyping many SNPs by iterating `flexdog()` over the SNPs. Support is provided for parallel computing through the [`future`](https://cran.r-project.org/package=future) package. The genotyping method is described in Gerard et al. (2018) and Gerard and Ferrão (2020).

# Fit `multidog()`

Let's load `updog`, `future`, and the data from Uitdewilligen et al. (2013).
```{r setup}
library(future)
library(updog)
data("uitdewilligen")
```

`uitdewilligen$refmat` is a matrix of reference counts while 
`uitdewilligen$sizemat` is a matrix of total read counts. In these data,
the rows index the individuals and the columns index the 
loci. But for insertion into `multidog()` we need it the other way around 
(individuals in the columns and loci in the rows). 
So we will transpose these matrices.
```{r}
refmat  <- t(uitdewilligen$refmat)
sizemat <- t(uitdewilligen$sizemat)
ploidy  <- uitdewilligen$ploidy
```

`sizemat` and `refmat` should have the same row and column names. These names
identify the loci and the individuals.
```{r}
setdiff(colnames(sizemat), colnames(refmat))
setdiff(rownames(sizemat), rownames(refmat))
```

If we want to do parallel computing, we should check that we have the
proper number of cores:
```{r}
parallelly::availableCores()
```

Now let's run `multidog()`:
```{r}
mout <- multidog(refmat = refmat, 
                 sizemat = sizemat, 
                 ploidy = ploidy, 
                 model = "norm",
                 nc = 2)
```

By default, parallelization is run using 
```{r, eval = FALSE}
future::plan(future::multisession, workers = nc)
```
if `nc` is greater than 1. You can choose your own evaluation strategy by running `future::plan()` prior to running `multidog()`, and then setting `nc = NA`. This should be particularly useful in higher performance computing environments that use schedulers, where you can control the evaluation strategy through the [`future.batchtools`](https://cran.r-project.org/package=future.batchtools) package. For example, the following will run `multidog()` using forked R processes:
```{r, eval = FALSE}
future::plan(future::multicore, workers = 2)
mout <- multidog(refmat = refmat, 
                 sizemat = sizemat, 
                 ploidy = ploidy, 
                 model = "norm",
                 nc = NA)

## Shut down parallel workers
future::plan(future::sequential)
```

# `multidog()` Output

There is a plot method for the output of `multidog()`.
```{r}
plot(mout, indices = c(1, 5, 100))
```

The output of multidog contains two data frame. The first contains properties
of the SNPs, such as estimated allele bias and estimated sequencing error rate.

```{r}
str(mout$snpdf)
```

The second data frame contains properties of each individual at each SNP, such as the
estimated genotypes (`geno`) and the posterior probability of being genotyping correctly (`maxpostprob`).

```{r}
str(mout$inddf)
```

You can obtain the columns in `inddf` in matrix form with `format_multidog()`.

```{r}
genomat <- format_multidog(mout, varname = "geno")
head(genomat)
```

To filter SNPs based on quality metrics (bias, sequencing error rate, 
overdispersion, etc), you can use `filter_snp()`, which uses the same 
non-standard evaluation you are used to from `dplyr::filter()`. 
That is, you can define predicates in terms of the variable
names in the `snpdf` data frame from the output of `mupdog()`. It then 
keeps rows in both `snpdf` and `inddf` where the predicate for a SNP
evaluates to `TRUE`.

```{r}
dim(mout$snpdf)
dim(mout$inddf)
mout_cleaned <- filter_snp(mout, prop_mis < 0.05 & bias > exp(-1) & bias < exp(1))
dim(mout_cleaned$snpdf)
dim(mout_cleaned$inddf)
```

# References

Gerard, David, and Luís Felipe Ventorim Ferrão. "Priors for genotyping polyploids." *Bioinformatics* 36, no. 6 (2020): 1795-1800. <https://doi.org/10.1093/bioinformatics/btz852>.

Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. "Genotyping Polyploids from Messy Sequencing Data." *Genetics* 210 (3). Genetics: 789–807. <https://doi.org/10.1534/genetics.118.301468>.

Uitdewilligen, Anne-Marie A. AND D’hoop, Jan G. A. M. L. AND Wolters. 2013. "A Next-Generation Sequencing Method for Genotyping-by-Sequencing of Highly Heterozygous Autotetraploid Potato." *PLOS ONE* 8 (5). Public Library of Science: 1–14. <https://doi.org/10.1371/journal.pone.0062355>.