---
title: "Simulate Next-Generation Sequencing Data"
author: "David Gerard"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulate Next-Generation Sequencing Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=4.5, 
  fig.height=3.5
)
```

# Abstract

We demonstrate how to simulate NGS data under various genotype distributions, then fit these data using `flexdog`. The genotyping methods are described in Gerard et al. (2018).

# Analysis

Let's suppose that we have 100 hexaploid individuals, with varying levels of read-depth.
```{r, message=FALSE}
set.seed(1)
library(updog)
nind    <- 100
ploidy  <- 6
sizevec <- round(stats::runif(n   = nind, 
                              min = 50, 
                              max = 200))
```

We can simulate their read-counts under various genotype distributions, allele biases, overdispersions, and sequencing error rates using the `rgeno` and `rflexdog` functions.

## F1 Population

Suppose these individuals are all siblings where the first parent has 4 copies of the reference allele and the second parent has 5 copies of the reference allele. Then the following code, using `rgeno`, will simulate the individuals' genotypes.
```{r}
true_geno <- rgeno(n      = nind, 
                   ploidy = ploidy,
                   model  = "f1", 
                   p1geno = 4,
                   p2geno = 5)
```

Once we have their genotypes, we can simulate their read-counts using `rflexdog`. Let's suppose that there is a moderate level of allelic bias (0.7) and a small level of overdispersion (0.005). Generally, in the real data that I've seen, the bias will range between 0.5 and 2 and the overdispersion will range between 0 and 0.02, with only a few extremely overdispersed SNPs above 0.02.

```{r}
refvec <- rflexdog(sizevec = sizevec, 
                   geno    = true_geno, 
                   ploidy  = ploidy, 
                   seq     = 0.001, 
                   bias    = 0.7, 
                   od      = 0.005)
```

When we plot the data, it looks realistic
```{r}
plot_geno(refvec  = refvec, 
          sizevec = sizevec, 
          ploidy  = ploidy, 
          bias    = 0.7, 
          seq     = 0.001,
          geno    = true_geno)
```

We can test `flexdog` on these data
```{r}
fout <- flexdog(refvec  = refvec, 
                sizevec = sizevec, 
                ploidy  = ploidy, 
                model   = "f1")
```

`flexdog` gives us reasonable genotyping, and it accurately estimates the proportion of individuals mis-genotyped.
```{r}
plot(fout)

## Estimated proportion misgenotyped
fout$prop_mis

## Actual proportion misgenotyped
mean(fout$geno != true_geno)
```

## HWE Population

Now run the same simulations assuming the individuals are in Hardy-Weinberg population with an allele frequency of 0.75.

```{r}
true_geno <- rgeno(n      = nind,
                   ploidy = ploidy,
                   model  = "hw", 
                   allele_freq = 0.75)
refvec <- rflexdog(sizevec = sizevec, 
                   geno    = true_geno, 
                   ploidy  = ploidy, 
                   seq     = 0.001, 
                   bias    = 0.7, 
                   od      = 0.005)
fout <- flexdog(refvec  = refvec, 
                sizevec = sizevec, 
                ploidy  = ploidy, 
                model   = "hw")
plot(fout)

## Estimated proportion misgenotyped
fout$prop_mis

## Actual proportion misgenotyped
mean(fout$geno != true_geno)

## Estimated allele frequency close to true allele frequency
fout$par$alpha
```

# References

Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. "Genotyping Polyploids from Messy Sequencing Data." *Genetics* 210 (3). Genetics: 789–807. <https://doi.org/10.1534/genetics.118.301468>.