\input{share/preamble}

  %\VignetteIndexEntry{Simulation of insurance data}
  %\VignettePackage{actuar}
  %\SweaveUTF8

  \title{Simulation of insurance data with \pkg{actuar}}
  \author{Christophe Dutang \\ Université Paris Dauphine \\[3ex]
    Vincent Goulet \\ Université Laval \\[3ex]
    Mathieu Pigeon \\ Université du Québec à Montréal \\[3ex]
    Louis-Philippe Pouliot \\ Université Laval}
  \date{}

<<echo=FALSE>>=
library(actuar)
options(width = 52, digits = 4)
@

\begin{document}

\maketitle

\section{Introduction}
\label{sec:introduction}

Package \pkg{actuar} provides functions to facilitate the generation
of random variates from various probability models commonly used in
actuarial applications. From the simplest to the most sophisticated,
these functions are:
\begin{enumerate}
\item \code{rmixture} to simulate from discrete mixtures;
\item \code{rcompound} to simulate from compound models (and a
  simplified version, \code{rcompois} to simulate from the very common
  compound Poisson model);
\item \code{rcomphierarc} to simulate from compound models where both
  the frequency and the severity components can have a hierarchical
  structure.
\end{enumerate}


\section{Simulation from discrete mixtures}
\label{sec:rmixture}

A random variable is said to be a discrete mixture of the random
variables with probability density functions $f_1, \dots, f_n$ if its
density can be written as
\begin{equation}
  \label{eq:mixture}
  f(x) = p_1 f_1(x) + \dots + p_n f_n(x) = \sum_{i = 1}^n p_i f_i(x),
\end{equation}
where $p_1, \dots, p_n$ are probabilities (or weights) such that
$p_i \geq 0$ and $\sum_{i = 1}^n p_i = 1$.

Function \code{rmixture} makes it easy to generate random variates
from such mixtures. The arguments of the function are:
\begin{enumerate}
\item \code{n} the number of variates to generate;
\item \code{probs} a vector of values that will be normalized
  internally to create the probabilities $p_1, \dots, p_n$;
\item \code{models} a vector of expressions specifying the simulation
  models corresponding to the densities $f_1, \dots, f_n$.
\end{enumerate}

The specification of simulation models follows the syntax of
\code{rcomphierarc} (explained in greater detail in \autoref{sec:rcomphierarc}). In
a nutshell, the models are expressed in a semi-symbolic fashion using
an object of mode \code{"expression"} where each element is a complete
call to a random number generation function, with the number of
variates omitted.

The following example should clarify this concept.

\begin{example}
  Let $X$ be a mixture between two exponentials: one with mean $1/3$
  and one with mean $1/7$. The first exponential has twice as much
  weight as the second one in the mixture. Therefore, the density of
  $X$ is
  \begin{equation*}
    f(x) = \frac{2}{3} (3 e^{-3x}) + \frac{1}{3} (7 e^{-7x}) \\
    = 2 e^{-3x} + \frac{7}{3} e^{-7x}.
  \end{equation*}
  The following expression generates $10$ random variates from this
  density using \code{rmixture}.
<<echo=TRUE>>=
rmixture(10, probs = c(2, 1),
         models = expression(rexp(3), rexp(7)))
@
  \qed
\end{example}

See also \autoref{ex:comppois} for a more involved application
combining simulation from a mixture and simulation from a compound
Poisson model.


\section{Simulation from compound models}
\label{sec:rcompound}

Actuaries often need to simulate separately the frequency and the
severity of claims for compound models of the form
\begin{equation}
  \label{eq:definition-S}
  S = C_1 + \dots + C_N,
\end{equation}
where $C_1, C_2, \dots$ are the mutually independent and identically
distributed random variables of the claim amounts, each independent of
the frequency random variable $N$.

Function \code{rcompound} generates variates from the random variable
$S$ when the distribution of both random variables $N$ and $C$ is non
hierarchical; for the more general hierarchical case, see
\autoref{sec:rcomphierarc}. The function has three arguments:
\begin{enumerate}
\item \code{n} the number of variates to generate;
\item \code{model.freq} the frequency model (random variable $N$);
\item \code{model.sev} the severity model (random variable $C$).
\end{enumerate}

Arguments \code{model.freq} and \code{model.sev} are simple
R expressions consisting of calls to a random number
generation function with the number of variates omitted. This is of
course similar to argument \code{models} of \code{rmixture}, only with
a slightly simpler syntax since one does not need to wrap the calls in
\code{expression}.

Function \code{rcomppois} is a simplified interface for the common
case where $N$ has a Poisson distribution and, therefore, $S$ is
compound Poisson. In this function, argument \code{model.freq} is
replaced by \code{lambda} that takes the value of the Poisson
parameter.

\begin{example}
  Let $S \sim \text{Compound Poisson}(1.5, F)$, where $1.5$ is the
  value of the Poisson parameter and $F$ is the cumulative
  distribution function of a gamma distribution with shape parameter
  $\alpha = 3$ and rate parameter $\lambda = 2$. We obtain variates
  from the random variable $S$ using \code{rcompound} or
  \code{rcompois} as follows:
<<echo=TRUE>>=
rcompound(10, rpois(1.5), rgamma(3, 2))
rcomppois(10, 1.5, rgamma(3, 2))
@

  Specifying argument \code{SIMPLIFY = FALSE} to either function will
  return not only the variates from $S$, but also the underlying
  variates from the random variables $N$ and $C_1, \dots, C_N$:
<<echo=TRUE>>=
rcomppois(10, 1.5, rgamma(3, 2), SIMPLIFY = FALSE)
@
  \qed
\end{example}

\begin{example}
  \label{ex:comppois}
  Theorem~9.7 of \cite{LossModels4e} states that the sum of compound
  Poisson random variables is itself compound Poisson with Poisson
  parameter equal to the sum of the Poisson parameters and severity
  distribution equal to the mixture of the severity models.

  Let $S = S_1 + S_2 + S_3$, where: %
  $S_1$ is compound Poisson with mean frequency $\lambda = 2$ and
  severity gamma with parameters $(3, 1)$; %
  $S_2$ is compound Poisson with $\lambda = 1$ and severity Gamma with
  parameters $(5, 4)$; %
  $S_3$ is compound Poisson with $\lambda = 1/2$ and severity
  Lognormal with parameters $(2, 1)$. %
  By the aforementioned theorem, $S$ is compound Poisson with
  $\lambda = 2 + 1 + 1/2 = 7/2$ and severity density
  \begin{equation*}
    f(x) = \frac{4}{7}
    \left(
      \frac{1}{\Gamma(3)} x^2 e^{-x}
    \right)
    + \frac{2}{7}
    \left(
      \frac{4^5}{\Gamma(5)} x^4 e^{-4x}
    \right)
    + \frac{1}{7} \phi(\ln x - 2).
  \end{equation*}

  Combining \code{rcomppois} and \code{rmixture} we can generate
  variates of $S$ using the following elegant expression.
<<echo=TRUE>>=
x <- rcomppois(1e5, 3.5,
               rmixture(probs = c(2, 1, 0.5),
                        expression(rgamma(3),
                                   rgamma(5, 4),
                                   rlnorm(2, 1))))
@

  One can verify that the theoretical mean of $S$ is $6 + 5/4 +
  (e^{5/2})/2 = 13.34$. Now, the empirical mean based on the above
  sample of size $10^5$ is:
<<echo=TRUE>>=
mean(x)
@
  \qed
\end{example}


\section{Simulation from compound hierarchical models}
\label{sec:rcomphierarc}

Hierarchical probability models are widely used for data classified in
a tree-like structure and in Bayesian inference. The main
characteristic of such models is to have the probability law at some
level in the classification structure be conditional on the outcome in
previous levels. For example, adopting a bottom to top description of
the model, a simple hierarchical model could be written as
\begin{equation}
  \label{eq:basic_model}
  \begin{split}
    X_t|\Lambda, \Theta
    &\sim \text{Poisson}(\Lambda) \\
    \Lambda|\Theta
    &\sim \text{Gamma}(3, \Theta) \\
    \Theta &\sim \text{Gamma}(2, 2),
  \end{split}
\end{equation}
where $X_t$ represents actual data. The random variables $\Theta$ and
$\Lambda$ are generally seen as uncertainty, or risk, parameters in
the actuarial literature; in the sequel, we refer to them as
mixing parameters.

The example above is merely a multi-level mixture of models, something
that is simple to simulate ``by hand''. The following R
expression will yield $n$ variates of the random variable $X_t$:
<<echo=TRUE, eval=FALSE>>=
rpois(n, rgamma(n, 3, rgamma(n, 2, 2)))
@

However, for categorical data common in actuarial applications there
will usually be many categories --- or \emph{nodes} --- at each level.
Simulation is then complicated by the need to always use the correct
parameters for each variate. Furthermore, one may need to
simulate both the frequency and the severity of claims for compound
models of the form \eqref{eq:definition-S}.

This section briefly describes function \code{rcomphierarc} and its usage.
\cite{Goulet:simpf:2008} discuss in more details the models
supported by the function and give more thorough examples.

\subsection{Description of hierarchical models}
\label{sec:rcomphierarc:description}

We consider simulation of data from hierarchical models. We want a
method to describe these models in R that meets the
following criteria:
\begin{enumerate}
\item simple and intuitive to go from the mathematical formulation of
  the model to the R formulation and back;
\item allows for any number of levels and nodes;
\item at any level, allows for any use of parameters higher in the
  hierarchical structure.
\end{enumerate}

A hierarchical model is completely specified by the number of nodes at
each level and by the probability laws at each level. The number of
nodes is passed to \code{rcomphierarc} by means of a named list where
each element is a vector of the number of nodes at a given level.
Vectors are recycled when the number of nodes is the same throughout a
level.

Probability models are expressed in a semi-symbolic fashion using an
object of mode \code{"expression"}. Each element of the object must be
named --- with names matching those of the number of nodes list ---
and should be a complete call to an existing random number generation
function, but with the number of variates omitted. Hierarchical models
are achieved by replacing one or more parameters of a distribution at
a given level by any combination of the names of the levels above. If
no mixing is to take place at a level, the model for this level can be
\code{NULL}.

\begin{example}
  Consider the following expanded version of model \eqref{eq:basic_model}:
  \begin{align*}
    X_{ijt}|\Lambda_{ij}, \Theta_i
    &\sim \text{Poisson}(\Lambda_{ij}), & t &= 1, \dots, n_{ij} \\
    \Lambda_{ij}|\Theta_i
    &\sim \text{Gamma}(3, \Theta_i), & j &= 1, \dots, J_i \\
    \Theta_i
    &\sim \text{Gamma}(2, 2), & i &= 1, \dots, I,
  \end{align*}
  with
  $I = 3$,
  $J_1 = 4$,
  $J_2 = 5$,
  $J_3 = 6$ and
  $n_{ij} \equiv n = 10$.
  Then the number of nodes and the probability model are respectively
  specified by the following expressions.
  \begin{Schunk}
\begin{Verbatim}
list(Theta = 3, Lambda = c(4, 5, 6), Data = 10)
\end{Verbatim}
  \end{Schunk}
  \begin{Schunk}
\begin{Verbatim}
expression(Theta = rgamma(2, 2),
           Lambda = rgamma(3, Theta),
           Data = rpois(Lambda))
\end{Verbatim}
  \end{Schunk}
  \qed
\end{example}

Storing the probability model requires an expression object in order
to avoid evaluation of the incomplete calls to the random number
generation functions. Function \code{rcomphierarc} builds and executes the
calls to the random generation functions from the top of the
hierarchical model to the bottom. At each level, the function
\begin{enumerate}
\item infers the number of variates to generate from the number of
  nodes list, and
\item appropriately recycles the mixing parameters simulated
  previously.
\end{enumerate}

The actual names in the list and the expression object can be
anything; they merely serve to identify the mixing parameters.
Furthermore, any random generation function can be used. The only
constraint is that the name of the number of variates argument is
\code{n}.

In addition, \code{rcomphierarc} supports usage of weights in models. These
usually modify the frequency parameters to take into account the
``size'' of an entity. Weights are used in simulation wherever the
name \code{weights} appears in a model.

\subsection[Usage of rcomphierarc]{Usage of \code{rcomphierarc}}
\label{sec:rcomphierarc:usage}

Function \code{rcomphierarc} can simulate data for structures where both the
frequency model and the severity model are hierarchical. It has four
main arguments:
\begin{enumerate}
\item \code{nodes} for the number of nodes list;
\item \code{model.freq} for the frequency model;
\item \code{model.sev} for the severity model;
\item \code{weights} for the vector of weights in lexicographic
  order, that is all weights of entity 1, then all weights of
  entity 2, and so on.
\end{enumerate}

The function returns the variates in a list of class
\code{"portfolio"} with a \code{dim} attribute of length two. The list
contains all the individual claim amounts for each entity. Since every
element can be a vector, the object can be seen as a three-dimension
array with a third dimension of potentially varying length. The
function also returns a matrix of integers giving the classification
indexes of each entity in the portfolio.

The package also defines methods for four generic functions to easily
access key quantities for each entity of the simulated portfolio:
\begin{enumerate}
\item a method of \code{aggregate} to compute the aggregate claim
  amounts $S$;
\item a method of \code{frequency} to compute the number of claims
  $N$;
\item a method of \code{severity} (a generic function introduced by
  the package) to return the individual claim amounts $C_j$;
\item a method of \code{weights} to extract the weights matrix.
\end{enumerate}

In addition, all methods have a \code{classification} and a
\code{prefix} argument. When the first is \code{FALSE}, the
classification index columns are omitted from the result. The second
argument overrides the default column name prefix; see the
\code{rcomphierarc.summaries} help page for details.

The following example illustrates these concepts in detail.

\begin{example}
  Consider the following compound hierarchical model:
  \begin{equation*}
    S_{ijt} = C_{ijt1} + \dots + C_{ijt N_{ijt}},
  \end{equation*}
  for
  $i = 1, \dots, I$,
  $j = 1, \dots, J_i$,
  $t = 1, \dots, n_{ij}$
  and with
  \begin{align*}
    N_{ijt}|\Lambda_{ij}, \Phi_i &\sim \text{Poisson}(w_{ijt} \Lambda_{ij}) &
    C_{ijtu}|\Theta_{ij}, \Psi_i &\sim \text{Lognormal}(\Theta_{ij}, 1)
    \notag \\
    \Lambda_{ij}|\Phi_i &\sim \text{Gamma}(\Phi_i, 1) &
    \Theta_{ij}|\Psi_i     &\sim N(\Psi_i, 1) \\
    \Phi_i &\sim \text{Exponential}(2) &
    \Psi_i &\sim N(2, 0.1). \notag
  \end{align*}
  (Note how weights modify the Poisson parameter.) Using as convention
  to number the data level 0, the above is a two-level compound
  hierarchical model.

  Assuming that $I = 2$, $J_1 = 4$, $J_2 = 3$, $n_{11} = \dots =
  n_{14} = 4$ and $n_{21} = n_{22} = n_{23} = 5$ and that weights are
  simply simulated from a uniform distribution on $(0.5, 2.5)$, then
  simulation of a data set with \code{rcomphierarc} is achieved with
  the following expressions.
<<echo=FALSE>>=
set.seed(3)
@
<<echo=TRUE>>=
nodes <- list(cohort = 2,
              contract = c(4, 3),
              year = c(4, 4, 4, 4, 5, 5, 5))
mf <- expression(cohort = rexp(2),
                 contract = rgamma(cohort, 1),
                 year = rpois(weights * contract))
ms <- expression(cohort = rnorm(2, sqrt(0.1)),
                 contract = rnorm(cohort, 1),
                 year = rlnorm(contract, 1))
wijt <- runif(31, 0.5, 2.5)
pf <- rcomphierarc(nodes = nodes,
                   model.freq = mf, model.sev = ms,
                   weights = wijt)
@

  Object \code{pf} is a list of class \code{"portfolio"} containing,
  among other things, the aforementioned two-dimension list as element
  \code{data} and the classification matrix (subscripts $i$ and $j$)
  as element \code{classification}:
<<echo=TRUE>>=
class(pf)
pf$data
pf$classification
@


  The output of \code{pf\$data} is not much readable. If we were to print the
  results of \code{rcomphierarc} this way, many users would wonder
  what \code{Numeric,\emph{n}} means. (It is actually R's
  way to specify that a given element in the list is a numeric vector
  of length $n$ --- the third dimension mentioned above.) To ease
  reading, the \code{print} method for objects of class
  \code{"portfolio"} only prints the simulation model and the number
  of claims in each node:
<<echo=TRUE>>=
pf
@

  By default, the method of \code{aggregate} returns the values of
  $S_{ijt}$ in a regular matrix (subscripts $i$ and $j$ in the rows,
  subscript $t$ in the columns). The method has a \code{by} argument
  to get statistics for other groupings and a \code{FUN}
  argument to get statistics other than the sum:
<<echo=TRUE>>=
aggregate(pf)
aggregate(pf, by = c("cohort", "year"), FUN = mean)
@

  The method of \code{frequency} returns the values of $N_{ijt}$. It
  is mostly a wrapper for the \code{aggregate} method with the default
  \code{sum} statistic replaced by \code{length}. Hence, arguments
  \code{by} and \code{FUN} remain available:
<<echo=TRUE>>=
frequency(pf)
frequency(pf, by = "cohort")
@

  The method of \code{severity} returns the individual
  variates $C_{ijtu}$ in a matrix similar to those above, but with a
  number of columns equal to the maximum number of observations per
  entity,
  \begin{displaymath}
    \max_{i, j} \sum_{t = 1}^{n_{ij}} N_{ijt}.
  \end{displaymath}
  Thus, the original period of observation (subscript $t$) and the
  identifier of the severity within the period (subscript $u$) are
  lost and each variate now constitute a ``period'' of observation. For
  this reason, the method provides an argument \code{splitcol} in case
  one would like to extract separately the individual severities of
  one or more periods:
<<echo=TRUE>>=
severity(pf)
severity(pf, splitcol = 1)
@

  Finally, the weights matrix corresponding to the data in object
  \code{pf} is
<<echo=TRUE>>=
weights(pf)
@

  Combined with the argument \code{classification = FALSE}, the above
  methods can be used to easily compute loss ratios:
<<echo=TRUE>>=
aggregate(pf, classif = FALSE)/
    weights(pf, classif = FALSE)
@
  \qed
\end{example}

\begin{example}
  \cite{Scollnik:2001:MCMC} considers the following model for the
  simulation of claims frequency data in a Markov Chain Monte Carlo
  (MCMC) context:
  \begin{align*}
    S_{it}|\Lambda_i, \alpha, \beta &\sim \text{Poisson}(w_{ij} \Lambda_i) \\
    \Lambda_i|\alpha, \beta         &\sim \text{Gamma}(\alpha, \beta) \\
    \alpha &\sim \text{Gamma}(5, 5) \\
    \beta  &\sim \text{Gamma}(25, 1)
  \end{align*}
  for $i = 1, 2, 3$, $j = 1, \dots, 5$ and with weights $w_{it}$
  simulated from
  \begin{align*}
    w_{it}|a_i, b_i &\sim \text{Gamma}(a_i, b_i) \\
    a_i &\sim U(0, 100) \\
    b_i &\sim U(0, 100).
  \end{align*}
  Strictly speaking, this is not a hierarchical model since the random
  variables $\alpha$ and $\beta$ are parallel rather than nested.
  Nevertheless, with some minor manual intervention, function
  \code{rcomphierarc} can simulate data from this model.

  First, one simulates the weights (in lexicographic order) with
<<echo=FALSE>>=
set.seed(123)
@
<<echo=TRUE>>=
wit <- rgamma(15, rep(runif(3, 0, 100), each = 5),
              rep(runif(3, 0, 100), each = 5))
@

  Second, one calls \code{rcomphierarc} to simulate the frequency data. The key
  here consists in manually inserting the simulation of the shape and rate
  parameters of the gamma distribution in the model for $\Lambda_i$.
  Finally, wrapping the call to \code{rcomphierarc} in \code{frequency} will
  immediately yield the matrix of observations:
<<echo=TRUE>>=
frequency(rcomphierarc(list(entity = 3, year = 5),
            expression(entity = rgamma(rgamma(1, 5, 5),
                                rgamma(1, 25, 1)),
                       year = rpois(weights * entity)),
            weights = wit))
@
  \qed
\end{example}

One will find more examples of \code{rcomphierarc} usage in the
\code{simulation} demo file. The function was used to simulate the
data in \cite{Goulet_cfs}.

%% References
\bibliography{actuar}

\end{document}

%%% Local Variables:
%%% mode: noweb
%%% coding: utf-8
%%% TeX-master: t
%%% End: