\documentclass[11pt, A4]{article} %\usepackage[brazil]{babel} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{url} \usepackage{Sweave} \usepackage{natbib} \usepackage{framed, color} \usepackage{xspace} \definecolor{shadecolor}{rgb}{0.9, 0.9, 0.9} \setlength{\parindent}{0pt} \setlength{\hoffset}{-0.5in} \setlength{\textwidth}{6in} \setlength{\voffset}{-0.1in} %\pdfpagewidth=\paperwidth %\pdfpageheight=\paperheight \newcommand{\R}{\textnormal{\sffamily\bfseries R}\xspace} % altered bc \sf is obsolete, see: % http://tex.stackexchange.com/questions/74478/latex-command-incantation-for-r \newcommand{\code}[1]{\texttt{#1}} \SweaveOpts{eval=TRUE, keep.source=TRUE, echo=TRUE} %\VignetteIndexEntry{Introduction to sads} \begin{document} \title{Fitting species abundance models with maximum likelihood \\ Quick reference for \code{sads} package} \author{Paulo In\'acio Prado, Murilo Dantas Miranda and Andre Chalom \\ Theoretical Ecology Lab \\ LAGE at the Dep of Ecology, USP, Brazil \\ %\url{http://ecologia.ib.usp.br/let/} \\ \url{prado@ib.usp.br}} \date{\today} \maketitle @ <>= options(width=60, continue=" ") @ %def \section{Introduction} Species abundance distributions (SADs) are one of the basic patterns of ecological communities \citep{McGill2007}. The empirical distributions are traditionally modeled through probability distributions. Hence, the maximum likelihood method can be used to fit and compare competing models for SADs. The package \code{sads} provides functions to fit the most used models to empirical SADs. The resulting objects have methods to evaluate fits and compare competing models. The package also allows the simulation of SADs expected from communities' samples, with and without aggregation of individuals of the same species. \section{Installation} The package is available on CRAN and can be installed in \R with the command: @ <>= install.packages('sads') @ %def then loaded by <>= library(sads) @ %def \subsection{Developer version} \label{sec:developer-version} The current developer version can be installed from GitHub with: @ <>= library(devtools) install_github(repo = 'piLaboratory/sads', ref= 'dev', build_vignettes = TRUE) @ And then load the package: @ <>= library(sads) @ %def \section{Exploratory analyses} \label{sec:analise-exploratoria} Throughout this document we'll use two data sets of abundances from the sads package. For more information on these data please refer to their help pages: @ <>= data(moths)# William's moth data used by Fisher et al (1943) data(ARN82.eB.apr77)# Arntz et al. benthos data data(birds)# Bird census used by Preston (1948) @ %def \subsection{Octaves} \label{sec:oitavas} Function \code{octav} tabulates the number of species in classes of logarithm of abundances at base 2 (Preston's octaves) and returns a data frame \footnote{actually an object of class \emph{octav} which inherits from class \emph{dataframe}}: @ <>= (moths.oc <- octav(moths)) (arn.oc <- octav(ARN82.eB.apr77)) @ %def A logical argument \code{preston} allows smoothing the numbers as proposed by \citet{Preston1948}. The octave number is the upper limit of the class in log2 scale. Hence, for abundance values smaller than one (\emph{e.g.} biomass data) the octave numbers are negative. A Preston plot is a histogram of this table, obtainable by applying the function \code{plot} to the data frame: \setkeys{Gin}{width=0.65\textwidth} {\centering @ <>= plot(moths.oc) @ def } {\centering @ <>= plot(arn.oc) @ %def } The plot method for objects of class \code{octav} has a logical argument \code{prop} that rescales the y-axis to relative frequencies of species in each octave, which can be used to compare different data sets: @ <>= plot(moths.oc, prop = TRUE, border=NA, col=NA) lines(octav(birds), mid = FALSE, prop = TRUE, col="red") lines(octav(moths), mid = FALSE, prop = TRUE) legend("topright", c("Preston's birds", "Fisher's moths"), col=c("red", "blue"), lty=1, bty="n") @ %def \subsection{Rank-abundance plots} \label{sec:rank_abund} Function \code{rad} returns a data frame of sorted abundances and their ranks \footnote{actually an object of class \emph{rad} which inherits from class \emph{dataframe}}: @ <>= head(moths.rad <- rad(moths)) head(arn.rad <- rad(ARN82.eB.apr77)) @ %def To get the rank-abundance or Whitaker's plot apply the function \code{plot} on the data frame: \setkeys{Gin}{width=0.65\textwidth} {\centering % centers BOTH of the next plots @ <>= plot(moths.rad, ylab="Number of individuals") @ %def } {\centering @ <>= plot(arn.rad, ylab="Biomass") @ %def } Again, the plot method for \code{rad} has a logical argument \code{prop} rescales the y-axis to depict relative abundances: @ <>= plot(moths.rad, prop = TRUE, type="n") lines(rad(birds), prop = TRUE, col="red") lines(rad(moths), prop = TRUE) legend("topright", c("Preston's birds", "Fisher's moths"), col=c("red", "blue"), lty=1, bty="n") @ %def \section{Model fitting} \label{sec:ajuste-e-selecao} The \emph{sads} package provides maximum-likelihood fits of many probability distributions to empirical sads. The working horses are the functions \code{fitsad} for fitting species abundance distributions and \code{fitrad} for fitting rank-abundance distributions. The first argument of these functions is the vector of observed abundances \footnote{\code{fitrad} also accepts a rank-abundance table returned by function \code{rad} as its first argument.} The second argument is the name of the model to be fitted. Please refer to the help page of the functions for details on the models. For more information on the fitting procedure see also the vignette of the \emph{bbmle} package, on top of which the package \emph{sads} is built. To fit a log-series distribution use the argument \code{sad='ls'}: @ <>= (moths.ls <- fitsad(moths,'ls')) @ %def The resulting model object inherits from \emph{mle2} \citep{Bolkerbbmle}, and has all usual methods for model objects, such as summaries, log-likelihood, and AIC values: @ <>= summary(moths.ls) coef(moths.ls) logLik(moths.ls) AIC(moths.ls) @ %def On the above examples, notice that the \code{print} method\footnote{Or, equivalently, the \code{show} method} displays some statistics on the input data and fitting function used - number of species, number of individuals, truncation point for the probability distribution (when used, see below) and whether we are fitting a discrete or continuous sad or rad - while the \code{summary} method displays information more associated with the fitting \emph{per se}: standard errors and significance codes for each parameter. Also, notice that the input data is displayed by both methods, but the \code{print} method only shows the first values, as the complete list can be quite large. \subsection{Model diagnostics} \label{sec:model-diagnostics} Many other diagnostic and functions are available for sad and rad models. To get likelihood profiles, likelihood intervals and confidence intervals use: \setkeys{Gin}{width=\textwidth} @ <>= moths.ls.prf <- profile(moths.ls) likelregions(moths.ls.prf) #likelihood intervals confint(moths.ls.prf) @ %def Then use \code{plotprofmle} to plot likelihood profiles at the original scale (relative negative log-likelihood) and function \code{plot} to get plots at chi-square scale (square-root of twice the relative log-likelihood): \setkeys{Gin}{width=\textwidth} @ <>= par(mfrow=c(1,2)) plotprofmle(moths.ls.prf)# log-likelihood profile plot(moths.ls.prf)# z-transformed profile par(mfrow=c(1,1)) @ %def \begin{shaded} \textbf{Likelihood intervals and confindence intervals:} \hfill Likelihood intervals include all values of the parameters that have up to a given log-likelihood absolute difference to the maximum likelihood estimate. This difference is the log-likelihood ratio and is set with the argument \code{ratio} of function \code{likelregions}. The default value of \code{ratio} is log(8), and thus in the example above the likelihood interval encloses all values of the parameter that are up to 8 times as plausible as the estimated value of $\alpha = $ \Sexpr{unname(round(coef(moths.ls)[2],2))}. Likelihood intervals at log(8) converge to the value of confidence intervals at 95\% as sample size increases. In most cases even for moderate sample sizes the limits of confidence and likelihood intervals are very close. Discrepancies occur only when the likelihood profile is highly asymmetric or have local {\em minima}. But in this kind of profile usually indicates an ill-behaved fit, and so the intervals may not be meaningful anyway. \end{shaded} When applied on a sad model object, the function \code{plot} returns four diagnostic plots: @ <>= par(mfrow=c(2,2)) plot(moths.ls) par(mfrow=c(1,1)) @ %def The first two plots (top right and left) are the octave and rank-abundance plots with the predicted values of number of species in each octave and of each species' abundance. The two last plots (bottom) are quantile-quantile and percentile-percentile graphs of the observed vs. predicted abundances. The straight line indicates the expected relation in case of perfect fit. \subsection{SADs vs RADs} Species-abundance models assign a probability for each abundance value. Thus, these models are probability density functions (PDFs) of abundances of species. Rank-abundance models assign a probability for each \textbf{abundance rank}. They are PDFs for rankings of species. The models are interchangeable \citep{May1975}, but currently only four rad models are available in package sads through the argument \code{rad} of function \code{fitrad}: \begin{itemize} \item ``gs'': geometric series (which is NOT geometric PDF, available in \code{fitsad} as ``geom'') \item ``rbs'': broken-stick model \citep{macarthur1957, May1975} \item ``zipf'': Zipf power-law distribution \item ``mand'': Zipf-Mandelbrot power-law distribution \end{itemize} \begin{shaded} \textbf{Comparison to \code{radfit} from \emph{vegan} package:} \hfill fits by \code{fitsad}, \code{fitrad} and \code{radfit} of \emph{vegan} package provide similar estimates of model coefficients but not comparable likelihood values. The reason for this is the fact each function fits models that assign probability values to data in different ways. Function \code{fitsad} fits PDFs to observed abundances and \code{fitrad} fits PDFs to the ranks of the abundances. Finally, \code{radfit} of \emph{vegan} fits a Poisson generalized linear model to the \emph{expected abundances} deduced from rank-abundance relationships from the corresponding sads and rads models \citep{wilson1991}. See also the help page of \code{radfit}. Therefore \textbf{likelihoods obtained from these three functions are not comparable}. \end{shaded} \section{Model selection} It's possible to fit other models to the same data set, such as the Poisson-lognormal and a truncated lognormal: @ <>= (moths.pl <- fitsad(x=moths, sad="poilog"))#default is zero-truncated (moths.ln <- fitsad(x=moths, sad="lnorm", trunc=0.5)) # lognormal truncated at 0.5 @ %def moreover, the function \code{AICtab} and friends from the \emph{bbmle} package can be used to get a model selection table: @ <>= AICtab(moths.ls, moths.pl, moths.ln, base=TRUE) @ %def \begin{shaded} \textbf{Flawed model comparisons} \hfill The information criterion methods do not differentiate between objects of \code{fitsad}, \code{fitrad} or \code{fitsadC} classes. Because of this, it is possible to include \code{fitsad} and \code{fitrad} objects in the same IC-table without generating an error, but the result will be meaningless. \end{shaded} To compare visually fits first get octave tables: @ <>= head(moths.ls.oc <- octavpred(moths.ls)) head(moths.pl.oc <- octavpred(moths.pl)) head(moths.ln.oc <- octavpred(moths.ln)) @ %def then use \code{lines} to superimpose the predicted values on the octave plot: \setkeys{Gin}{width=0.75\textwidth} @ <>= plot(moths.oc) lines(moths.ls.oc, col="blue") lines(moths.pl.oc, col="red") lines(moths.ln.oc, col="green") legend("topright", c("Logseries", "Poisson-lognormal", "Truncated lognormal"), lty=1, col=c("blue","red", "green")) @ %def To do the same with rank-abundance plots get the rank-abundance objects: @ <>= head(moths.ls.rad <- radpred(moths.ls)) head(moths.pl.rad <- radpred(moths.pl)) head(moths.ln.rad <- radpred(moths.ln)) @ %def then plot observed and predicted values: @ <>= plot(moths.rad) lines(moths.ls.rad, col="blue") lines(moths.pl.rad, col="red") lines(moths.ln.rad, col="green") legend("topright", c("Logseries", "Poisson-lognormal", "Truncated lognormal"), lty=1, col=c("blue","red", "green")) @ %def \subsection{Abundance class data} For ecological communities, the representation of species abundances typically occurs through categorization. For instance, when assessing the abundance of sessile organisms, it is common practice to utilize a scale based on the coverage of sampling areas, because direct enumeration of individuals or estimation of biomass are extremely laborious. The package \code{sads} has a specific class for fitting continuous distributions for this kind of data. We will show the use of this class using the data from \citet{Vieira2020}, who provide the coverage class of each plant species in plots set in grasslands in Southern Brazil. The object \code{grasslands} has the data from plot 'CA8', which has the largest number of species recorded in this study. @ <>= head(grasslands) @ %def The vector \code{cover} in this data frame has the cover class for each plant species, that is, the proportion of the area of the plot covered by all individuals of each species. Coverage is expressed in a scale with 13 intervals, as defined by the breakpoints below: @ <>= (grass.brk <- c(0,1,3,5,seq(15,100, by=10),100) ) @ %def We can tally the number of species in each cover class using a histogram: @ <>= grass.h <- hist(grasslands$mids, breaks = grass.brk, plot = FALSE) @ %def The resulting object of the class \code{histogram} has the number of species in each cover class, as well as the classes midpoints: @ <>= data.frame(midpoint = grass.h$mids, N.spp = grass.h$counts) @ %def The function \code{fitsadC} fits continuous distributions usually applied to describe this kind of data. The commands below fit all models currently available for abundance class data in the package \code{sads}. Note that the data is provided to this function through an object of the class \code{histogram}: @ <>= grass.e <- fitsadC(grass.h, 'exp') # Exponential grass.g <- fitsadC(grass.h, 'gamma') # Pareto grass.l <- fitsadC(grass.h, 'lnorm') # Log-normal grass.p <- fitsadC(grass.h, 'pareto') # Pareto grass.w <- fitsadC(grass.h, 'weibull') # Weibull @ %def The fitted models are of class \code{fitsadC}, for which most of the methods for diagnostics and model comparison showed in the previous sections are available. For instance, the fitted models can be compared in the usual way: @ <>= AICctab(grass.e, grass.g, grass.l, grass.p, grass.w, weights = TRUE, base = TRUE) @ %def A histogram with observed values can ploted simply with @ <>= plot(grass.h, main = "", xlab = "Abundance class") @ %def By default, R plots histograms with classes of unequal size using a density scale. The function \code{coverpred} provides the number of species expected by a fited model, in frequency, relative frequency and density scales. @ <>= ## Predicted by each model grass.e.p <- coverpred(grass.e) grass.g.p <- coverpred(grass.g) grass.l.p <- coverpred(grass.l) grass.p.p <- coverpred(grass.p) grass.w.p <- coverpred(grass.w) @ %def We can then use this object to add the densities values predicted by each model: @ <>= ## Plot plot(grass.h, main = "", xlab = "Abundance class", xlim = c(0,40)) ## Adds predicted points points(grass.e.p, col = 1) points(grass.g.p, col = 2) points(grass.l.p, col = 3) points(grass.p.p, col = 4) points(grass.w.p, col = 5) legend("topright", legend = c("Exponential", "Gamma", "Log-normal", "Pareto", "Weibull"), col = 1:5, bty = "n", lty = 1 , pch = 1) @ %def Please refer to the man pages of \code{fitsadC} and \code{fitsaC-class} for further methods and usage. \section{Simulations} The function \code{rsad} returns random samples of a community with $S$ species. The mean abundances of the species in the communities are independent identically distributed (\emph{iid}) variables that follow a given probability distribution. The sample simulates a given number of draws of a fraction $a$ from the total number of individuals in the community. For instance, to simulate two Poisson samples of 10\% of a community with 10 species that follows a lognormal distribution with parameters $\mu=3$ and $\sigma=1.5$ use: @ <>= set.seed(42)# fix random seed to make example reproducible (samp1 <- rsad(S = 10, frac = 0.1, sad = "lnorm", coef=list(meanlog = 3, sdlog = 1.5), zeroes=TRUE, ssize = 2)) @ The function returns a data frame with a sample numeric label, species' numeric label and species' abundance in each sample. By default, \code{rsad} returns a vector of abundances of single Poisson sample with zeroes omitted: @ <>= (samp2 <- rsad(S = 100, frac=0.1, sad="lnorm", list(meanlog=5, sdlog=2))) @ Since this is a Poisson sample of a lognormal community, the abundances in the sample should follow a Poisson-lognormal distribution with parameters $\mu + \log a $ and $\sigma$ \citep{grotan2008}. We can check this by fitting a Poisson-lognormal model to the sample: @ <>= (samp2.pl <- fitsad(samp2, "poilog")) ## checking correspondence of parameter mu coef(samp2.pl)[1] - log(0.1) @ %def Not bad. By repeating the sampling and the fit many times it's possible to evaluate the bias and variance of the maximum likelihood estimates: \setkeys{Gin}{width=\textwidth} @ <>= results <- matrix(nrow=75,ncol=2) for(i in 1:75){ x <- rsad(S = 100, frac=0.1, sad="lnorm", list(meanlog=5, sdlog=2)) y <- fitsad(x, "poilog") results[i,] <- coef(y) } results[,1] <- results[,1]-log(0.1) @ Bias is estimated as the difference between the mean of estimates and the value of parameters: @ <>= ##Mean of estimates apply(results,2,mean) ## relative bias (c(5,2)-apply(results,2,mean))/c(5,2) @ %def And the precision of the estimates are their standard deviations @ <>= ##Mean of estimates apply(results,2,sd) ## relative precision apply(results,2,sd)/apply(results,2,mean) @ Finally, a density plot with lines indicating the mean of estimates and the values of parameters: @ <>= par(mfrow=c(1,2)) plot(density(results[,1]), main=expression(paste("Density of ",mu))) abline(v=c(mean(results[,1]),5), col=2:3) plot(density(results[,2]), main=expression(paste("Density of ",sigma))) abline(v=c(mean(results[,2]), 2), col=2:3) par(mfrow=c(1,1)) @ %def Increasing the number of simulations improves these estimators. \section{Bugs and issues} \label{sec:bugs-issues} The package project is hosted on GitHub (\url{https://github.com/piLaboratory/sads/}). Please report bugs and issues and give us your feedback at \url{https://github.com/piLaboratory/sads/issues}. \bibliographystyle{ecology} \bibliography{sads_intro} \end{document}