---
title: "whoa Tutorial"
author: "Eric C. Anderson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{whoa Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


This is a small, lightweight package that lets users investigate the
distribution of genotypes in genotype-by-sequencing (GBS) data where they expect (by and large)
Hardy-Weinberg equilibrium, in order to assess rates of genotyping errors and
the dependence of those rates on read depth.

The name comes from the bolded letters in this sentence:

**W**here's my **H**eterozygotes at?  **O**bservations on genotyping **A**ccuracy.

It also fits well with my reaction when I started investigating 
heterozygote miscall rates (rates at which true heterozygotes are
incorrectly called as homozygotes) in some restriction associated digest (RAD)
sequencing data sets---My eyes
bugged out and I said, "Whoa!"

The package comes with a small bit of data from lobster to play with.  The rest of 
this document shows a quick run through a few of the functions to do an 
analysis of a data set.


## A quick run 

### Packages

```{r}
# load up the package:
library(whoa)

```

### Lobster data

Read about the lobster data here. Execute this if you want:
```{r, eval=FALSE}
help("lobster_buz_2000")
```
The main thing to know is that it is a 'vcfR' object.  You can 
make such an object yourself by reading in a variant call format (VCF) file 
using `vcfR::read.vcfR()`.

### Make a quick genotype frequency scatter plot

```{r, fig.width=7}
# first get compute expected and observed genotype frequencies
gfreqs <- exp_and_obs_geno_freqs(lobster_buz_2000)

# then plot those.  Set max_plot_loci so that all 2000
# loci will be plotted
geno_freqs_scatter(gfreqs, max_plot_loci = 2000)
```

### Now infer an overall heterozygote miscall rate.

If we want to estimate the het miscall rate (over all read depth bins)
we just set the minimum bin size to a very large value so it make just one bin, and
use the `infer_m` function which implements a Markov chain Monte Carlo (MCMC) algorithm
to estimate the heteryzogote miscall rate.
```{r}
overall <- infer_m(lobster_buz_2000, minBin = 1e15)
```
Now look at that:
```{r}
overall$m_posteriors
```

Wow! (Or should we say "WHOA!") A het miscall rate of around 25%.

### Now infer a miscall rate for read depth bins

See the total_n above is about 65,000.  That means 65,000 genotypes. 
(2000 loci typed at 36 individuals, with some missing data).  
We will bin those up so that there are at least 2000 genotypes in each 
bin and then estimate the het miscall rate for each read depth bin.

```{r}
binned <- infer_m(lobster_buz_2000, minBin = 2000)
```

And then we can plot the posterior mean and CIs for each read depth bin.
```{r, fig.width=7}
posteriors_plot(binned$m_posteriors)
```

Again, WHOA!  The het miscall rate at low read depths is super high!


