---
title: "Gambin overview"
author: Thomas J. Matthews and Colin Gillespie
output: rmarkdown::html_vignette
bibliography: refs.bib
vignette: >
  %\VignetteIndexEntry{Gambin overview}
  \usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

# Overview

The gambin distribution is a sample distribution based on a stochastic model of 
species abundances, and has been demonstrated to fit empirical data better than 
the most commonly used species-abundance distribution (SAD) models (see
@matthews2014gambin and
@ugland2007modelling). Gambin is a stochastic model which combines the gamma 
distribution with a binomial sampling method. To fit the gambin distribution, 
the abundance data is first binned into octaves using a simple log2 transform 
that doubles the number of abundance classes within each octave. Thus, octave 0 
contains the number of species with 1 individual, octave 1 the number of species
with 2 or 3 individuals, octave 2 the number of species with 4 to 7 individuals,
and so forth (method 3 in @gray2006).

The gambin distribution is flexible, meaning it can fit a variety of empirical 
SAD shapes (including lognormal and logseries-like shapes), and that the 
distribution shape (in the context of the unimodal gambin model) is adequately 
characterised by the model’s single parameter (α): low values of alpha indicate 
logserieslike SADs, and high alpha values indicate lognormal-like SADs. As such,
the alpha parameter can be used as a metric to compare the shape of SADs from 
different ecological communities; for example, along an environmental gradient
(e.g @arellano2017)

The expected abundance octave of a species is given by the number of successfull
consecutive Bernoulli trials with a given parameter $p$. The parameter $p$ of 
species is assumed to distributed according to a gamma distribution. This 
approach can be viewed as linking the gamma distribution with the probability of
success in a binomial process with $x$ trials. Use the fit_abundances() function
to fit the gambin model to a vector of species abundances, optionally using a
subsample of the individuals. The package estimates the alpha (shape) parameter
with associated confidence intervals. Methods are provided for plotting the
results, and for calculating the likelihood of fits. The summary() function 
provides the confidence intervals around alpha, and also the results of a
X2 goodness of fit test. Prior to package version 2.4.4, we simply used the
default degrees of freedom in this test (i.e. number of data points - 1).
This is not optimal as the degrees of freedom should arguably also include the
number of parameters used to fit the gambin model itself. As such, in version
2.4.4 we have edited the degrees of freedom to reflect this. One problem is that
the chisq.test() function in R does not have an argument for setting the degrees
of freedom; thus, we have had to use a workaround. As a result of this change,
X2 results generated using older versions of the package will differ slightly
from those using 2.4.4 and later.

It has become increasingly apparent that many empirical SADs are in fact 
multimodal (@antao2017). As such, recent work has focused on expanding the
standard unimodal gambin model to allow it to fit distributions with multiple
modes (@Matthews2019). For example, the bimodal gambin model can be
calculated as the integration of two gambin distributions. The corresponding
likelihood function for the bimodal gambin model contain's four parameters: the
shape parameters for the first and second group, the max octave of the first
group (as this is allowed to vary), and one splitting parameter (split)
representing the fraction of objects in the first group.It is relatively
straightforward to extend the above approach for fitting the bimodal gambin
model by maximum likelihood, to fitting gambin models with g modes. For each
additional mode, a further three parameters are needed: the additional alpha,
max octave and split parameters (see @Matthews2019). Use the
fit_abundances() function in combination with the no_of_components of argument.
The default is no_of_components = 1, which fits the standard unimodal gambin
model. no_of_components = 2, fits the bimodal gambin model, and so on. As the
optimisation procedure takes long with no_of_components > 1, it is possible to
use the cores argument within fit_abundances to make use of parallel processing
in the maximum likelihood optimisation.

The deconstruct_modes() function can then be used to examine a multimodal gambin
model fit. The function provides the location of the modal octaves of each
component distribution and (if species classification data are provided)
determines the proportion of different types of species in each octave.

Often the aim of SAD studies is to compare the form of the SAD across different
sites / samples. The alpha parameter of the one component gambin model (alpha)
has been found to provide a useful metric in this regard. Use the
mult_abundances() function to calculate alpha values for a set of different
samples / sites. However, because the alpha parameter of the gambin model is
dependent on sample size, when comparing the alpha values between sites it can
be useful to first standardise the number of individuals in all sites. By
default, the mult_abundances() function calculates the total number of
individuals in each site and selects the minimum value for standardising. This
minimum number of individuals is then sampled from each site and the gambin
model fitted to this subsample and the alpha value stored. This process is then
repeated N times and the mean alpha value is calculated for each site.


## Example

```{r}
library("gambin")
data(moths, package="gambin")

##unimodal model
fit = fit_abundances(moths)
fit$alpha
barplot(fit)
points(fit)
AIC(fit)

##unimodal model (fit to a subsample of 1000 individuals)
fit2 = fit_abundances(moths, subsample = 1000)
fit2$alpha
barplot(fit2)
points(fit2)
AIC(fit2)

##bimodal model (using 3 cores)

#simulate bimodal gambin distribution
x1 = rgambin(600, 5, 10)
x2 = rgambin(300, 1, 10)
x = table(c(x1,x2))
freq = as.vector(x)
values = as.numeric(as.character(names(x)))
abundances = data.frame(octave=values, species = freq)

#fit bimodal model to simulated data
fit3 = fit_abundances(abundances, no_of_components = 2, cores = 1)
barplot(fit3)
points(fit3)
AIC(fit3)
#compare with AIC of unimodal model
AIC(fit_abundances(abundances))

#fit a bimodal model to a species classification dataset
#and calculate the number of the differet categories in each octave
data(categ, package="gambin")
fits2 = fit_abundances(categ$abundances, no_of_components = 2)
d1 <- deconstruct_modes(fits2, dat = categ, peak_val = NULL, abundances = "abundances", 
     species = "species", categ = "status", col.statu = c("green", "red", "blue"),
     plot_legend = FALSE)
#do the same but don't provide category data - this just highlights the modal octaves
d2 <- deconstruct_modes(fits2, dat = categ, peak_val = NULL, abundances = "abundances", 
     species = "species", categ = NULL)

```

## References