---
title: "Diagnostic Plots for Fitting Distributions"
author: "Thomas Roh"
date: "December 17, 2017"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Diagnostic Plots for Fitting Distributions}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.height = 5, fig.width = 7)
library(fitur)
library(ggplot2)
```

The `fitur` package includes several tools for visually inspecting how good of a
fit a distribution is. To start, fictional empirical data is generated below. 
Typically this would come from a *real-world* dataset such as the time it takes 
to serve a customer at a bank, the length of stay in an emergency department, or
customer arrivals to a queue.

```{r stats}
set.seed(438)
x <- rweibull(10000, shape = 5, scale = 1)
```

## Histogram

Below is a histogram showing the shape of the distribution and the y-axis has 
been set to show the probability density. 

```{r histPlot}
dt <- data.frame(x)
nbins <- 30
g <- ggplot(dt, aes(x)) +
  geom_histogram(aes(y = ..density..), 
                bins = nbins, fill = NA, color = "black") +
  theme_bw() +
  theme(panel.grid = element_blank())
g
```

## Histogram vs Density Plot

Three distributions have been chosen below to test against the dataset. Using
the `fit_univariate` function, each of the distributions are fit to a *fitted* 
object. The first item in each of the *fits* is the probabilty density function. 
Each *fit* is overplotted onto the histogram to see which distribution fits 
best.

```{r densPlot}
dists <- c('gamma', 'lnorm', 'weibull')
multipleFits <- lapply(dists, fit_univariate, x = x)
plot_density(x, multipleFits, 30) + theme_bw() +
  theme(panel.grid = element_blank())
```

## Q-Q Plot

The next plot used is the quantile-quantile plot. The `plot_qq` function takes 
a numeric vector *x* of the empirical data and sorts them. A range
of probabilities are computed and then used to compute comparable quantiles 
using the `q` distribution function from the *fitted* objects. A good fit would 
closely align with the abline y = 0 + 1*x. Note: the q-q plot tends to be more
sensitive around the "tails" of the distributions.

```{r qqplot}
plot_qq(x, multipleFits) +
  theme_bw() +
  theme(panel.grid = element_blank())
```

## P-P Plot

The Percentile-Percentile plot rescales the input data to the interval (0, 1] and
then calculates the theoretical percentiles to compare. The `plot_pp` function 
takes the same inputs as the Q-Q Plot but it performs on rescaling of x and 
then computes the percentiles using the `p` distribution of the *fitted* object.
A good fit matches the abline y = 0 + 1*x. Note: The P-P plot tends to be more
sensitive in the middle of the distribution. 

```{r ppplot}
plot_pp(x, multipleFits) +
  theme_bw() +
  theme(panel.grid = element_blank())
```

