---
title: "Introduction"
output: rmarkdown::html_vignette
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: Mikkel Meyer Andersen
bibliography: refs.bib
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Introduction}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 5, fig.height = 5)
```

# Introduction

This vignette shows how to use the R package `disclapmix` that implements the method described in [@AndersenDisclap2013]. 

**The key function is `disclapmix_adaptive()`**.

For a more gentle introduction to the method, refer to [@AndersenDisclapIntroduction2013].

# Analysis

A Danish reference database [@Hallenberg2005YchromosomeSH] with $n = 185$ observations (male Y-STR haplotypes) at $r=10$ loci is available in the `danes` dataset. 
Let us load the package as well as the data:

```{r}
library(ggplot2)

library(disclapmix)
data(danes)
```

The database is in compact format, i.e. one unique haplotype per row.
To fit the model, we need one observation per row. 
This is done for example like this:

```{r}
db <- as.matrix(danes[rep(seq_len(nrow(danes)), danes$n), seq_len(ncol(danes)-1)])
str(db)
```

Also, note that the database is now an integer matrix.

# Short version

We fit models from 1 cluster up to as many necessary (see `?disclapmix_adaptive` 
for how this is determined):

```{r}
fits <- disclapmix_adaptive(x = db)
```

We can extract how well each model fits the data (judged by the marginal BIC [@BIC]):

```{r}
BICs <- sapply(fits, function(x) x$BIC_marginal)
bestfit <- fits[[which.min(BICs)]]
bestfit
```

We can plot how well each model fits the data:

```{r}
clusters <- seq_along(fits) # or: sapply(fits, function(x) nrow(x$y))
plot(clusters, BICs, type = "b")
points(nrow(bestfit$y), bestfit$BIC_marginal, pch = 4, cex = 4, col = "red")
```

And use the model to predict haplotype frequencies:

```{r}
db_haplotypes <- as.matrix(danes[, seq_len(ncol(db))])
p <- predict(bestfit, db_haplotypes)
plot(log10(danes$n / sum(danes$n)), log10(p), 
     xlab = "Observed relative frequency", 
     ylab = "Discrete Laplace estimated probability")
abline(0, 1)
```


# Longer version

To fit a model using 2 clusters, the following command can be used (note the `L` postfix to emphasize that the number is an integer):

```{r, eval=FALSE}
fit <- disclapmix(x = db, clusters = 2L)
```

The number of clusters is not known beforehand. 
Here, the numbers 1 through 5 are tried and the best one according to the BIC [@BIC] is taken:

```{r}
clusters <- 1L:5L
fits <- lapply(clusters, function(clusters) {
  fit <- disclapmix(x = db, clusters = clusters)
  return(fit)
})

marginalBICs <- sapply(fits, function(fit) {
  fit$BIC_marginal
})

bestfit <- fits[[which.min(marginalBICs)]]
```

The best fit is now in the `bestfit` that can be inspected by `print` (default method called when the variable is just written) or `summary` which (currently) give the same output: 

```{r}
bestfit
summary(bestfit)
```

We can also plot the fitted model:

```{r, fig.width=10}
plot(bestfit)
```

There are important information returned by `disclapmix`, e.g. the central haplotypes and the dispersion parameters for the discrete Laplace distributions:

```{r}
bestfit$y
bestfit$disclap_parameters
```

The returned object is described in `?disclapmix` and objects can be inspected using e.g. `str()`.


Haplotype frequencies can be obtained using the `predict` function. Note, that this is done per haplotype (`danes`) and not per observation (`db`):

```{r}
disclap_estimates <- predict(bestfit, 
                             newdata = as.matrix(danes[, 1:(ncol(danes) - 1)]))
```

These can be compared to the database frequencies:

```{r}
ggplot() +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(aes(x = danes$n/sum(danes$n), y = disclap_estimates)) +
  labs(x = "Observed frequency",
       y = "Predicted frequency (discrete Laplace)") +
  theme_bw() +
  scale_x_log10() +
  scale_y_log10()
```


# References