\documentclass[nojss]{jss}

%\VignetteIndexEntry{imputeMulti: An Introduction}
  %\VignetteAuthor{Alex Whitworth}
  %\VignetteKeywords{R, introduction}
  %\VignetteTangle{no}
  %\VignetteEngine{R.rsp::rsp}

% \usepackage{graphicx, color, hyperref, ae, fancyverb, natbib} % -- default JSS packages
\usepackage{amsmath, amssymb, amsthm} % American Math. Society Packages
\usepackage{thumbpdf} % JSS encouraged
\usepackage{float} 
\usepackage[font= footnotesize, labelfont= bf]{caption}
\usepackage{placeins}

% from Achim Zeileis -- email correspondance 6/25/2016
\newcommand{\Sconcordance}[1]{%
  \ifx\pdfoutput\undefined%
  \csname newcount\endcsname\pdfoutput\fi%
  \ifcase\pdfoutput\special{#1}%
  \else%
   \begingroup%
     \pdfcompresslevel=0%
     \immediate\pdfobj stream{#1}%
     \pdfcatalog{/SweaveConcordance \the\pdflastobj\space 0 R}%
   \endgroup%
  \fi}

%% need no \usepackage{Sweave.sty}
%------------------------------------------------------------
% title information
\author{Alex Whitworth}
\Plainauthor{Alex Whitworth}
\title{imputeMulti: Imputation for Multivariate Multinomial Missing Data}
\Shorttitle{imputeMulti}

\Abstract{
  \pkg{imputeMulti} is an \proglang{R} package for imputation of multivariate multinomial missing data 
  via expectation-maximization and data augmentation algorithms. The package allows the 
  specification of Bayesian priors, including data-dependent priors. For performance, 
  calculation of the summary statistics of the multinomial distribution is implemented in 
  \proglang{C++}; \pkg{imputeMulti} also supports these calculations in parallel. 
  As a result, the \pkg{imputeMulti} package capably handles large datasets. In this article, 
  we introduce the package's functionality and provide a hands-on approach to solving multinomial 
  missing data problems.
}
\Keywords{multinomial, missing data, imputation, expectation-maximization, data-augmentation, EM, DA, R}

\Address{
  Alex Whitworth \\
  Seattle, WA 98103 \\
  E-mail: \email{whitworth.alex@gmail.com}
}


%------------------------------------------------------------
\begin{document}

\section{Introduction}

As \cite{amelia} noted, ``missing data is a ubiquitous problem [especially] in social science data.
Respondents do not answer every question, countries do not collect statistics every year, archives are 
incomplete, [and] subjects drop out of panels. Most statistical analysis methods, however, assume the absence
of missing data, and are only able to include observations for which every variable is measured." 
In the past several decades, widely available methods have been developed for missing value imputation 
allowing practitioners to move beyond ad-hoc methods such as casewise-deletion and mean imputation, to 
more advanced methods such as multiple imputation \cite{rubin}, expectation maximization \cite{dempster}, 
chained equations \cite{mice}, and data augmentation \cite{data_aug}. However, many of 
these methods assume the data comes from a multivariate normal distribution, which ignores missing data from 
a variety of other distributions.

\pkg{imputeMulti} performs missing data imputation for the common 
multivariate multinomial distribution. Like other imputation methods, this method creates a ``filled in" 
version of the incomplete data so that analyses which require complete observations 
can appropriately use all the data in a dataset containing missingness. \pkg{imputeMulti} uses the familiar 
expectation-maximization (EM) and data augmentation (DA) algorithms to estimate the parameter 
estimates of the multinomial distribution and maximum likelihood to impute missing observations.

Two issues which make multivariate multinomial missing data imputation challenging are the 
memory required to store all possible combinations of the variables of interest and the computation
time required to calculate both the complete data sufficient statistics and the missing data marginal
sufficient statistics. To overcome these performance challenges, \pkg{imputeMulti} uses \proglang{C++} 
via \pkg{Rcpp} \cite{rcpp} and also allows for parallel processing 
via \pkg{parallel} \cite{r_core}.

\pkg{imputeMulti} provides advantages over other popular imputation methods such as \pkg{Amelia} \cite{amelia}, \pkg{mice} \cite{mice}, and \pkg{Hmisc} \cite{Hmisc} when working with multinomial data.
The most important advantage is higher imputaton accuracy for multinomial data relative to any of \pkg{Amelia}, 
\pkg{mice}, and \pkg{Hmisc}. Secondly, \pkg{imputeMulti} provides researchers easy access to the 
multinomial parameter estimates. In many cases, the parameter estimates may be of more interest to researchers
than the observation level imputations. One disadvantage of \pkg{imputeMulti} is longer run time
than \pkg{Amelia} or \pkg{Hmisc}, although \pkg{imputeMulti} is faster than \pkg{mice}. Run time
is impacted by unique aspects of the the multinomial distribution which will be discussed below.

The remainder of the paper proceeds as follows: Section~\ref{sec:multi} provides a brief discussion of multinomial
missing data problems; Section~\ref{sec:guide} provides a user's guide to \pkg{imputeMulti}; 
Section~\ref{sec:compare} compares \pkg{imputeMulti} with alternative imputation methods; and 
Section~\ref{sec:conclude} concludes.

%------------------------------------------------------------
\section{Multinomial missing data} \label{sec:multi}
%------------------------------------------------------------

\pkg{imputeMulti} is an implementation of the imputation methods for multivariate multinomial missing 
data found in \cite[Chapter~7]{schafer}. An abbreviated discussion of multivariate multinomial missing 
data is provided below. To facilitate use of this reference text, the same notation as \cite{schafer} 
is used throughout this discussion. 

To begin, let $Y_1, Y_2, \ldots, Y_p$ be categorical variables each taking a finite number of values  
$Y_j \in \{1,2,\ldots, d_j \}; j= 1,2,\ldots, p$. If the observations are independent and identically 
distributed (iid), then we can reduce $Y$ to a contingency table with $D$ cells, where 
$D= \prod_{j=1}^p d_j$ is the number of distinct combinations of the levels of $Y_1, Y_2, \ldots, Y_p$.
The cell counts of $D$ are denoted by $x_d, d = 1,\ldots, D$. If the 
sample size is \textit{n}, then \textit{x} has a multinomial distribution: 

\begin{align}
  x|\theta &\sim M(n,\theta)
\end{align}

with parameter vector $\theta = (\theta_1, \theta_2, \ldots, \theta_D )$. The likelihood function 
for the multinomial parameter $\theta$ is 

\begin{align}
  L(\theta|Y) &\propto \prod_{d=1}^D \theta_{d}^{x_d} I_{\Theta(\theta)}
\end{align}

where $I_{\Theta(\theta)}$ is an indicator function equal to 1 if $\theta \in \Theta$ and 0 otherwise. 
This leads to the well known maximum likelihood estimates (MLE): 

\begin{align}
  \hat \theta_d = \frac{x_d}{n}, d= 1,\ldots, D
\end{align}

\subsection{The Bayesian case and the Dirichlet prior}

The simplest way to conduct Bayesian inference on a multinomial model is to assume a Dirichlet prior 
\cite{dirichlet}, 
which is typically denoted by the parameter vector $\alpha= (\alpha_1, \alpha_2, \ldots, \alpha_D)$.
The prior and posterior of $\theta$ with a Dirichlet prior are thus written in shorthand as 

\begin{align}
  \theta|\alpha &\sim D(\alpha) \\
  \theta|Y &\sim D(\alpha')
\end{align}

where $\alpha' = (\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_D + x_D)$. The posterior mean is

\begin{align}
  \E(\theta|Y) &= \left(\frac{\alpha_1'}{\alpha_0'}, \frac{\alpha_2'}{\alpha_0'} \ldots, 
    \frac{\alpha_D'}{\alpha_0'} \right)
\end{align}

with $\alpha_0' = \sum_{d=1}^D (\alpha_d + x_d) = \alpha_0 + n$.

As is typical in Bayesian analyses, the choice of prior can have important effects on the outcome of 
missing value imputation for multinomial data. In addition to no prior, \pkg{imputeMulti} supports  
three choices of prior. When little prior information is known, a noninformative prior
may be a sensible choice. If the sample size is large relative to the number of parameters being estimated, 
a noninformative prior will have little impact on model inferences. The choice used in \pkg{imputeMulti}
is twice the Jefferys prior, $c=1$, which provides proper posterior modes of $\theta$.\footnote{Using the 
Jefferys prior, $c=\frac{1}{2}$, can produce improper posterior modes. Specifically, if $\alpha_y < 1$ 
and the corresponding count $x_y$ is zero, then $\hat \theta_y$ will be negative.} Another choice of 
prior is a flattening prior $\alpha= (c,c,\ldots,c)$ for some $c>1$. A final 
choice available in \pkg{imputeMulti} is the data-dependent prior, whose aim is to take a \textit{peek} 
at the data, via some summary statistics for instance, prior to analysis. Discussion and criticism 
of the use of data-dependent priors is beyond the scope of this article. Interested readers are 
referred to \cite{darnieder}.

\subsection{Characterizing the multivariate multinomial missing data problem}

At the core of all imputation methods is that we do not completely observe the data $x$, but instead only 
observe part of it. In the multinomial case, assume that we have grouped the observed data into their 
observed patterns $x^{obs}$ and their missingness patterns $x^{mis}$. Index the missingness paterns 
by $s= 1,2,\ldots,S$ and define a set of indicator variables:

\begin{align*}
  r_{sj} &= \left\{ \begin{array}{lc} 
    1 & \mbox{if } Y_j \mbox{ is observed in } s \\
    0 & \mbox{if } Y_j \mbox{ is missing in } s \end{array} \right.
\end{align*}

Let $O_s(y)$ and $M_s(y)$ respectively denote the sets of observed and missing variables within each
missingness pattern and with elements defined

\begin{align}
  O_s(y) &= \{y_j : r_{sj} = 1 \} \\
  M_s(y) &= \{y_j : r_{sj} = 0 \}.
\end{align}

Since most missing observations are \textit{partially} observed, within each missingness pattern $s$,
observations are cross-classified by their observed variables. The distinct partially observed patterns
are denoted $x^{(s)}$; and their counts tabulated into a table denoted

\begin{align}
  z_{O_s(y)}^{s} &= \sum_{M_s(y)\in M_s} x^{(s)} \mbox{ for all } O_s(y) \in O_s.
\end{align}

The marginal probability that an observation falls within a given cell of this table of partially 
observed values is denoted

\begin{align}
  \beta_{O_s(y)} = \sum_{M_s(y)\in M_s} \theta_{y.} \mbox{  .}
\end{align}

And the observed-data loglikelihood contribution from the partially observed data is

\begin{align}
  l(\theta|Y_{obs}) &= \sum_{s=1}^S \sum_{O_s(y)\in O_s} z_{O_s(y)}^{s} log\left(\beta_{O_s(y)} \right) .
\end{align}

The EM algorithm for maximizing the complete-data loglikelihood is straightforward, with the multivariate
case first described by \cite{fuchs}. For the E-step, we find the expected summary statistics 
given the observed data and the current value of theta, 
\begin{align}
  \E(x_y|Y_{obs}, \theta) &= \sum_{s=1}^S z_{O_s(y)}^{s} \theta_y / \beta_{O_s(y)} .
  \label{E-step}
\end{align}
This leads to the trivial M-step. Under maximum-likelihood, set $\hat \theta = \E(x_y|Y_{obs}, \theta) / n$ 
for all $y \in Y$. A minor modification is made to maximize the complete-data posterior density under 
a Dirichlet prior $\hat \theta_y = (x_y +\alpha_y - 1) / (n + \alpha_0 - D)$ for all $y \in Y$ where
$\alpha_0=\sum_i^D \alpha_i$ and $D$ is the number of parameters. Similarly, different minor modifications 
can be made to convert the EM algorithm to data-augmentation. For DA, instead of proportionally allocating the 
counts of partially missing observations to fully observed $x^{obs}$ counts based on the current value
of $\theta$ as in Equation~\ref{E-step}, the proportional allocation is replaced by a random allocation based on the 
current value of $\theta$.

%------------------------------------------------------------
\section{Software user's guide} \label{sec:guide}
%------------------------------------------------------------

We now turns to the practical matter of using \pkg{imputeMulti}, which is freely available as a package 
for the statistical software \proglang{R} \cite{r_core} and can be run in any environment that 
\proglang{R} can. \proglang{R} is freely available from \url{https://www.r-project.org/}. To install 
\pkg{imputeMulti} from a CRAN (Comprehensive R Archive Network) mirror, simply type the following 
command in the \proglang{R} command prompt,

\begin{Code}
  R> install.packages("imputeMulti").
\end{Code}

If you wish to use the most current development version, you can install it from Github. This is most
easily done using the \pkg{devtools} package \cite{devtools}, via

\begin{Code}
  R> devtools::install_github("alexWhitworth/imputeMulti").
\end{Code}

To keep \pkg{imputeMulti} up to date, you should use the \proglang{R} command 
\code{update.packages()}.

\subsection{Example data}

To illustrate \pkg{imputeMulti}, we use simulated (ACS) American Community Survey data for 
individuals living in census tract 2221 in Los Angeles County, California. The data was simulated
using spatial microsimulation \cite{orcutt}. This dataset contains ten variables on 3,974 individuals.
Missing values have been inserted, independently and at random, to seven of the ten variables.
A detailed description of the dataset can be found by typing \code{?tract2221} into the \proglang{R} console. 

\begin{Code}
  R> library("imputeMulti")
  R> data("tract2221")
\end{Code}

Beyond being useful to illustrate the use of \pkg{imputeMulti}, the data also serves to illustrate
one constraint on multivariate multinomial missing data imputation---combinatorial explosion. 
If we choose to use all ten variables in the model, the number of distinct 
combinations which may be observed is extremely high. Using the previously introduced notation, 
$D= \prod_{j=1}^p d_j = 8,467,200$. Given that seven variables have missing values, 
the number of possible distinct combinations of marginally observed variables is even higher at 38,707,200. 
In \proglang{R}, an integer vector of this size requires 155 MB of memory and a corresponding \code{data.frame}, 
which is needed to store $z_{O_s(y)}^{s}$, requires 1.7 GB of memory. If our model were substantially 
larger, we would quickly run into \pkg{base} \proglang{R}'s constraint on matrix size of 2 billion elements. 

One option to alleviate the memory constraint is the use of a package that stores objects locally, such 
as \pkg{bigmemory} \cite{bigmemory}. But alleviating memory concerns does not impact the larger problem 
of factorial runtime---O(p!)---factorial in the number of variables $p$. One approach is the use of a multivariate normal 
distribution to provide an approximate solution. The approach suggested in this guide is to instead 
find an exact solution to an approximate problem. That is, to decompose the set of all variables into 
subsets of related variables; and to find an exact solution for each subset. 

Run-time constraints are not the only benefit of decomposing an increasingly large set of variables. An additional benefit is that complex, higher-order, interactions may be poorly 
estimated by the fully saturated multinomial model. In these situations, it is often useful to
simplify the model by selectively removing some of the highest order associations.

In this dataset, for example, the researcher might decide that income, poverty and employment statuses,
educational attainment, age, and gender are closely related and thus 
group \code{age}, \code{gender}, \code{edu\_attain}, \code{pov\_status}, \code{emp\_status}, 
and \code{ind\_income} into one subset. Further, she might consider marital status, race, nativity, 
and geographic mobility to all be closely related. This would lead to a second subset of 
\code{marital\_status}, \code{nativity}, \code{geog\_mobility}, and \code{race}. This is the approach 
that we uses here to illustrate the use of \pkg{imputeMulti}. 

\subsection{Imputation via expectation-maximization}

The primary user-facing function for \pkg{imputeMulti} is \code{multinomial\_impute},
which provides observation level imputation for multinomial data. Various options can be specified 
to perform multinomial imputations using EM or DA along with the specification of one 
of four possible priors: \code{`none'}, \code{`non.informative'}, 
\code{`flat.prior'}, or \code{`data.dep'}. These options for imputing 
missing values via EM are now illustrated, starting without the use of a Bayesian prior.

\begin{Sinput}
  R> set.seed(2134L)
  R> imputeEM_none <- multinomial_impute(tract2221[,c(1:2,4,7,9)], 
       method = "EM", conj_prior = "none", verbose = TRUE)
\end{Sinput}

Here, \code{method = "EM"} is specified for the expectation-maximization algorithm and
\code{conj\_prior = "none"} for a strictly maximum likelihood estimate. The parameter
\code{verbose = TRUE} is specified to provide informative intermediate outputs. The option
\code{parallel = TRUE} can also be specified to compute the sufficient statistics in
parallel. In practice, this only provides a performance improvement for exceptionally large datasets 
and thus the default is currrently \code{parallel = FALSE}.

\begin{Code}                          
  [1] "Calculating observed data sufficient statistics."
  [1] "Setting up Iteration 1."
  Iteration 1 : log-likelihood = -32083.8393739337 
   Convergence Criteria = 0.0160612788 ... 
  Iteration 2 : log-likelihood = -24079.2963400782 
   Convergence Criteria = 0.0043943837 ...
  .... output truncated .... 
  Iteration 119 : log-likelihood = -23650.2292737947 
   Convergence Criteria = 0.0000004931 ...
  [1] "Imputing missing observations via MLE results."
\end{Code}

Typical of EM, convergence is quite rapid. By changing the argument 
\code{conj\_prior}, Bayesian priors may be specified. The other options for 
\code{conj\_prior} are shown below % NEED TO ADD: custom priors -- fully specify arugment alpha

\begin{Code}
  R> imputeEM_non <- multinomial_impute(tract2221[,c(1:2,4,7,9)], 
       method = "EM", conj_prior = "non.informative", verbose = TRUE)
  R> imputeEM_flat <- multinomial_impute(tract2221[,c(1:2,4,7,9)], 
       method = "EM", conj_prior = "flat.prior", alpha = 10L, verbose = TRUE)
  R> imputeEM_data <- multinomial_impute(tract2221[,c(1:2,4,7,9)], 
       method = "EM", conj_prior = "data.dep", verbose = TRUE)
\end{Code}

where a flat prior may be specified as a scalar by the parameter \code{alpha}. Note the use of a 
strong flat prior in this example. 

In addition to observation level imputation, some users may only be interested in the parameter estimates. 
\pkg{imputeMulti} allows for this type of analysis as well. To do so, first compute the observed and 
marginally-observed summary statistics via \code{multinomial\_stats} and then estimate the parameters 
directly with your choice of prior via \code{multinomial\_em}.  An example of this is shown 
below.\footnote{The differences in log-likelihood and convergence criteria in the displayed output of 
\code{multinomial\_impute} and \code{multinomial\_em} are due to 
random seeding of initial parameters. These random differences in initial parameter 
values lead to differences of only (<0.003) in final log-likelihood.}

\begin{CodeChunk}
\begin{CodeInput}
  R> x_y <- multinomial_stats(tract2221[,c(1:2,4,7,9)], output = "x_y")
  R> z_Os_y <- multinomial_stats(tract2221[,c(1:2,4,7,9)], output = "z_Os_y")
  R> x_possible <- multinomial_stats(tract2221[,c(1:2,4,7,9)], 
       output = "possible.obs")
  
  R> imputeEM_mle <- multinomial_em(x_y, z_Os_y, x_possible, 
       n_obs = nrow(tract2221), conj_prior = "none", verbose = TRUE)
\end{CodeInput}
\begin{CodeOutput}
  [1] "Setting up Iteration 1."
  Iteration 1 : log-likelihood = -32096.9776607773 
   Convergence Criteria = 0.0157098212 ... 
  Iteration 2 : log-likelihood = -24086.0109006056 
   Convergence Criteria = 0.0030262731 ...  
   .... output truncated ....
  Iteration 101 : log-likelihood = -23650.2320024124 
   Convergence Criteria = 0.0000004888 ... 
\end{CodeOutput}
\end{CodeChunk}

The outputs of these functions will be examined in Section~\ref{sec:outputs}. 

\subsection{Imputation via data-augmentation} \label{sec:DA}

Using \pkg{imputeMulti} for DA is very similar to the use for expectation-maximization. The chief 
functional difference is using \code{method = "DA"} instead of \code{method = "EM"}. Imputation of 
observation level data is still done via \code{multinomial\_impute}; and, the choice of prior can be 
controlled by \code{conj_prior}. An example is shown below using the second subset of variables discussed
at the beginning of this Section.

\begin{CodeChunk}
\begin{CodeInput}
  R> imputeDA_none <- multinomial_impute(tract2221[,c(3,6,9:10)], 
       method = "DA", conj_prior = "none", verbose = TRUE, burnin = 100)
\end{CodeInput}
\begin{CodeOutput}                          
  [1] "Calculating observed data sufficient statistics."
  [1] "Setting up Iteration 1."
  Iteration 1 : log-likelihood = -26927.4017334082 ... 
  Iteration 2 : log-likelihood = -18571.4267260919 ... 
   .... output truncated .... 
  Iteration 99 : log-likelihood = -18369.0208750788 ... 
  Iteration 100 : log-likelihood = -18381.5414370566 ...  
  [1] "Imputing missing observations via MLE results."
\end{CodeOutput}
\end{CodeChunk}

Parameter estimates via DA can also be easily obtained via \code{multinomial\_data\_aug}
in a similar fashion to EM. 

\begin{Code}
  R> imputeDA_mle <- multinomial_data_aug(x_y, z_Os_y, x_possible,
       conj_prior = "none", verbose = TRUE,
       burnin = 100, post_draws = 1000)
\end{Code}

With DA, the number of burnin samples is controled via \code{burnin} for both 
\code{multinomial\_impute} and \code{multinomial\_data\_aug}. \code{burnin} 
defaults to 100. Parameter estimates 
are calculated as the posterior mean based on \code{post\_draws} draws, which can again 
be specified for both functions. The default is 1000 draws. 

\subsection{Examining imputed outputs} \label{sec:outputs}

\pkg{imputeMulti} uses S4 classes. Both \code{multinomial\_em} and \code{multinomial\_data\_aug} 
return \code{mod\_imputeMulti-class} objects while \code{multinomial\_impute} returns \code{imputeMulti-class} 
objects. The \code{imputeMulti-class} class inherits from the \code{mod\_imputeMulti-class}. Several
methods are available for summarization and extracting elements from the class objects. However, most 
users will be interested in only two methods of the \code{imputeMulti-class} object: \code{get\_parameters} 
which extracts the parameter estimates, and \code{get\_imputations} which extracts the imputed observation level 
data.\footnote{Other methods for these classes can be found in the documentation. See 
\code{?'mod\_imputeMulti-class'} and \code{?'imputeMulti-class'}.} 

\begin{Code}
  R> param_est    <- get_parameters(imputeEM_none)
  R> imputed_data <- get_imputations(imputeEM_none)
\end{Code}

Since neither \code{multinomial\_em} nor \code{multinomial\_data\_aug} impute at the observation level, 
the \code{get\_imputations} method does not exist for \code{mod\_imputeMulti-class} objects. 

To recombine imputed data with the original dataset, utilize the fact that \pkg{imputeMulti} maintains
both row order and rownames. Recombining imputed and original data can therefore be done via a simple 
replacement or by merging via rownames. The latter method is implemented in \pkg{imputeMulti} and both
methods are shown below:

\begin{Code} 
  R> tract_imp <- data.frame(imputed_data, tract2221[, c(3,6,9:10)])
  R> tract_imp <- merge_imputed(imputeEM_none, tract2221[,c(3,6,9:10)])
\end{Code}

Researchers can also examine the parameter estimates directly. For example, researchers may be 
interested in the marginal distribution of $\hat \theta$ by gender 
and educational attainment among those aged eighteen to thirty-four compared to those aged fifty to 
sixty-four. Here a comparison is shown for a \code{"non.informative"} and \code{"flat.prior"}, which
also illustrates the impact of the previously chosen strong flat prior.

\begin{Code}
  R> param_est_non <- get_parameters(imputeEM_non)
  R> param_est_flat <- get_parameters(imputeEM_flat)
\end{Code}

We first extract the marginal statistics and then normalize $\hat \theta$ within each age group. 
The impact of the choice of prior on $\hat \theta$ is then shown in Tables \ref{tbl:marg_param_est_noprior}
and \ref{tbl:marg_param_est_flatprior}. 

% non-informative prior
\begin{Code}
  R> pe_non18 <- param_est_non[param_est_non$age %in% 
       c("18_24", "25_29", "30_34"),]
  R> pe_non50 <- param_est_non[param_est_non$age %in% 
       c("50_54", "55_59", "60_64"),]
  R> non_theta18_34 <- sum(pe_non18$theta_y)
  R> non_theta50_64 <- sum(pe_non50$theta_y)
  R> marg_non18 <- tapply(pe_non18$theta_y, 
       list(pe_non18$gender, pe_non18$edu_attain), sum)
  R> marg_non50 <- tapply(pe_non50$theta_y, 
       list(pe_non50$gender, pe_non50$edu_attain), sum)
  R> round(marg_non18  / non_theta18_34, 4)
  R> round(marg_non50  / non_theta50_64, 4)
\end{Code}

% strong flat prior
\begin{Code}
  R> pe_flat18 <- param_est_flat[param_est_flat$age %in% 
       c("18_24", "25_29", "30_34"),]
  R> pe_flat50 <- param_est_flat[param_est_flat$age %in% 
       c("50_54", "55_59", "60_64"),]
  R> flat_theta18_34 <- sum(pe_flat18$theta_y)
  R> flat_theta50_64 <- sum(pe_flat50$theta_y)
  R> marg_flat18 <- tapply(pe_flat18$theta_y, 
       list(pe_flat18$gender, pe_flat18$edu_attain), sum)
  R> marg_flat50 <- tapply(pe_flat50$theta_y, 
       list(pe_flat50$gender, pe_flat50$edu_attain), sum)
  R> round(marg_flat18 / flat_theta18_34, 4)
  R> round(marg_flat50 / flat_theta50_64, 4)
\end{Code}


\FloatBarrier
\begin{table}
\centering
\begin{tabular}[h]{l|ccccccc} \hline 
  \textbf{Ages 18-34} & \code{lt_hs} & \code{some_hs} & \code{hs_grad} & \code{some_col} & 
    \code{assoc_dec} & \code{ba_deg} & \code{grad_deg} \\
  \hline
  Female & 0.0507 & 0.0421 & 0.1523 & 0.1251 & 0.0000 & 0.0912 & 0.0205 \\
  Male   & 0.0423 & 0.0928 & 0.0922 & 0.1141 & 0.0187 & 0.1197 & 0.0383 \\
  \hline
  \textbf{Ages 50-64} & & & & & & & \\
  \hline
  Female & 0.1260 & 0.0866 & 0.1304 & 0.1051 & 0.0405 & 0.0000 & 0.0318 \\
  Male   & 0.1704 & 0.0980 & 0.0357 & 0.0471 & 0.0575 & 0.0710 & 0.0000 \\
  \hline
\hline \end{tabular}
\caption{Estimates of the marginal distribution of $\hat \theta$ by gender and educational attainment
using a non-informative prior. Estimates are for individuals aged 18-34 and 50-64.}
\label{tbl:marg_param_est_noprior}
\end{table}

\begin{table}
\centering
\begin{tabular}[h]{l|ccccccc} \hline 
  \textbf{Ages 18-34} & \code{lt_hs} & \code{some_hs} & \code{hs_grad} & \code{some_col} & 
    \code{assoc_dec} & \code{ba_deg} & \code{grad_deg} \\
  \hline
  Female & 0.0684 & 0.0671 & 0.0831 & 0.0791 & 0.0611 & 0.0744 & 0.0641 \\
  Male   & 0.0671 & 0.0744 & 0.0744 & 0.0776 & 0.0641 & 0.0784 & 0.0668 \\
  \hline
  \textbf{Ages 50-64} & & & & & & & \\
  \hline
  Female & 0.0763 & 0.0729 & 0.0764 & 0.0744 & 0.0687 & 0.0653 & 0.0680 \\
  Male   & 0.0801 & 0.0738 & 0.0652 & 0.0694 & 0.0702 & 0.0713 & 0.0652 \\
  \hline
\hline \end{tabular}
\caption{Estimates of the marginal distribution of $\hat \theta$ by gender and educational attainment
using a strong flat prior. Estimates are for individuals aged 18-34 and 50-64.}
\label{tbl:marg_param_est_flatprior}
\end{table}

The marginalized estimates of $\hat \theta$ with a non-informative prior show that education levels
have improved over time as the younger cohort has higher estimated parameters for
`some college' or higher educational attainment. It is also clear that there is greater gender equality
in educational outcomes for the younger cohort. But the analyst would not draw these conclusions
using a strong flat prior. As expected, a strong flat prior 
exerts substantial smoothing effects on the estimates of $\hat \theta$. While there is a large range
in parameter estimates for the non-informative prior, all estimates for the flat prior have been 
smoothed to near 0.07. This comparison shows that, as with most Bayesian analyses, the results of 
imputation of multinomial data can be quite sensitive to the choice of prior.

%------------------------------------------------------------
\section{Comparing imputeMulti} \label{sec:compare}
%------------------------------------------------------------

In this section, we compare \pkg{imputeMulti} with other popular imputation methods. Specifically,
we examine bootstrap EM via \pkg{Amelia} \cite{amelia}, polytomous logistic regression 
via \pkg{mice} \cite{mice}, and predictive mean matching via \pkg{Hmisc} \cite{Hmisc}. The goal
of the comparison is not to provide an \textit{exhaustive} comparison of imputation methods; but to
compare commonly used methods within each package. The \code{tract2221} dataset is again used for
illustration, in this case looking at imputation of five variables: \code{age}, \code{gender}, 
\code{marital_status}, \code{edu_attain}, and \code{emp_status}.

The first thing to note is that \pkg{Amelia} does not allow for categorical variables with greater than 
ten levels, while \code{tract2221$age} has sixteen levels.

\begin{CodeChunk}
\begin{CodeInput}
  R> library("Amelia")
  R> amelia_EM <- amelia(tract2221[,1:5], m = 1, noms = 1:5) 
\end{CodeInput}
\begin{CodeOutput}
  Warning message:
  In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors,  : 

  The number of categories in one of the variables marked nominal has greater 
  than 10 categories. 
  Check nominal specification.
\end{CodeOutput}
\end{CodeChunk}

This is one serious limitation on multivariate multinomial imputation when using a package designed 
for a multivariate normal distribution. Researchers using \pkg{Amelia} must immediately decide the 
best way to limit some of the richness of their data. None of the other three packages have this
restriction. To keep the remainder of the comparison as equivalent as possible, only four variables
are used: \code{gender}, \code{marital_status}, \code{edu_attain}, and \code{emp_status}.

\subsection{Comparing algorithm speed}

The \pkg{microbenchmark} package \cite{microbenchmark} is used for speed comparisons. Tests were 
performed on a Intel Xeon CPU E5-2960 v3 2.60GHz server running Windows Server 2012 R2 Standard and were
not run in parallel. To purely test algorithm speed, multiple imputations are not specified for any
packages that allow them.

\begin{CodeChunk}
\begin{CodeInput}
  R> library("Amelia")
  R> library("Hmisc")
  R> library("imputeMulti")
  R> library("mice")
  R> library("microbenchmark")
  R> test_d <- tract2221[,2:5]
  R> microbenchmark(
    EM = multinomial_impute(test_d, "EM", conj_prior = "non.informative"),
    DA = multinomial_impute(test_d, "DA", conj_prior = "non.informative"),
    amelia = amelia(test_d, m = 1, noms = 1:4),
    hmisc = aregImpute(~ gender + marital_status + edu_attain + emp_status, 
                    data = test_d, n.impute = 1, type = "pmm"),
    mice = mice(test_d, m = 1, method = "polyreg"), times = 10L)
\end{CodeInput}
\begin{CodeOutput}
  Unit: milliseconds
  expr       min        lq      mean    median        uq       max    neval
  EM      2271.97  2279.53  2336.20  2282.18  2418.99  2532.10    10
  DA      3664.21  3843.10  4328.98  4242.36  4948.12  4979.40    10
  amelia  144.29   149.73    175.99  151.40    153.20   402.19    10
  hmisc   682.13   688.32    822.01  819.47    945.75   997.42    10
  mice    3899.12 3913.96   4091.42 3956.81   4161.81  4995.85    10
\end{CodeOutput}
\end{CodeChunk}

All algorithms finish in a matter of milliseconds or seconds. \pkg{Amelia} is clearly the fastest 
algorithm, beating \pkg{imputeMulti} by more than an order of magnitude in this test. 
As expected, EM, which stops upon convergence, is noticeably faster than DA, which must run all 
\code{burnin} iterations before calculating posterior means. 

\subsection{Comparing outputs: Parameter estimates} \label{sec:param_compare}

Differences in the outputs of the various methods are examined next, focusing on differences 
in output rather than the the statistical properties of
each package. For a discussion of using the multivariate normal to approximate multinomial data, 
see \cite[Chapter~6]{schafer}. For the purposes of this article, a comparison of a single commonly 
used method within each package is used rather than an exhaustive comparison. As above, 
multiple imputations are not specified for any packages that allow them. Functions used for these comparisons
as well as post-processing the \pkg{Hmisc} and \pkg{mice} outputs (Section~\ref{sec:obs_compare}) 
are provided in the supplementary material to this article.

\begin{Code}
  R> source("./00_JSS_suplemental_functions.R")
  R> set.seed(1987543L)
  R> test_dat1 <- tract2221[complete.cases(tract2221[,2:5]), 2:5]
  R> test_dat2 <- createNAs(test_dat1, pctNA = 0.15) 
  R> rownames(test_dat2) <- 1:nrow(test_dat2)

  R> IM_EM <- multinomial_impute(test_dat2, "EM", 
       conj_prior = "non.informative", verbose = TRUE)
  R> amelia_EM <- amelia(test_dat2, m = 1, noms = c("gender", 
       "marital_status", "edu_attain", "emp_status"))
  R> hmisc_pmm <- aregImpute(~ gender + marital_status + edu_attain + 
       emp_status, data = test_dat2, n.impute = 1, type = "pmm")
  R> mice_ch_eq <- mice(test_dat2, m = 1, method = "polyreg")
\end{Code}

\pkg{Amelia}, \pkg{mice}, and \pkg{Hmisc} all use S3 classes for outputs. In both \pkg{mice} and \pkg{Hmisc}, 
the parameter estimates are not directly available. The parameter estimates from the \code{amelia} function 
are found in \code{amelia_EM$mu} and \code{amelia_EM$theta}. This output structure fits the expectation
that the data comes from a multivariate normal (eg. Y $\sim N(\mu, \Sigma)$). The output of 
\code{amelia_EM$mu} is shown below.

\begin{Code}
  noms.gender.2         -0.034183160
  noms.marital_status.2 -0.001370118
  noms.marital_status.3  0.001611829
  noms.marital_status.4 -0.005067206
  noms.marital_status.5  0.040268816
  noms.edu_attain.2      0.014664753
  noms.edu_attain.3      0.011949303
  noms.edu_attain.4      0.020386341
  noms.edu_attain.5     -0.024465460
  noms.edu_attain.6      0.015697896
  noms.edu_attain.7      0.003025067
  noms.emp_status.2     -0.022108630
  noms.emp_status.3      0.012268149
\end{Code}

As is clear from the above output, when working with multivariate multinomial data using the 
\pkg{Amelia} package, researchers are 
required to post-process parameter outputs to fit their needs. It is not immediately obvious 
how to get specific multinomial parameter estimates or how to marginalize these estimates, for example 
obtaining the marginal distribution of $\hat \theta$ by gender and educational attainment as 
in Section~\ref{sec:outputs}. An additional advantage to using \pkg{imputeMulti} is therefore 
improved ease of use for researchers interested in multinomial parameter estimates.

\FloatBarrier
\subsection{Comparing outputs: Observation level imputations} \label{sec:obs_compare}

Observation level imputations are provided by \pkg{Amelia}, \pkg{mice}, and \pkg{Hmisc}. In \pkg{Amelia}
and in \pkg{imputeMulti}, imputed observations are returned in the original dataset and can be compared 
directly. For both \pkg{mice} and \pkg{Hmisc} only the missing observations and their row numbers are 
returned. The imputations thus require post-processing to insert the imputed values into the original 
dataset containing missing values. 

To test accuracy of observation level imputations, two tests are run. The first test does not 
allow  for multiple imputations, while the second does. To setup the tests, the \code{complete.cases}
from \code{tract2221[,2:5]} are extracted and missing values are randomly inserted. Observation
level imputation is then conducted for these datasets and the imputations were compared to the original,
fully observed, dataset. To test sensitivity to the amount of missingness in the data,
tests are run using 15\%, 30\%, 45\%, and 60\% missing values. 

The first test compares accuracy of \pkg{imputeMulti} to a 
single imputation in the other methods. The second test allows ten multiple imputations and compares
the accuracy of a single \pkg{imputeMulti} imputation to the maximum accuracy across the ten multiple
imputations using the other methods. Ten such tests are run for each level of 
inserted missingness and the mean results are shown in Table~\ref{tbl:impute_accuracy_test1} and 
Table~\ref{tbl:impute_accuracy_test2}. 

Unsurprisingly, imputation accuracy degrades for each method as missingness increases. But, in each test, 
\pkg{imputeMulti} has the highest accuracy for completely matching true observations; and, \pkg{imputeMulti}
maintains this advantage even against multiple imputations. The closest comparison method was \pkg{mice}. 

\begin{table}
\centering
\begin{tabular}[h]{l|ccccc} \hline 
  Percent Missing & IM-EM &  IM-DA & Amelia & Hmisc & mice \\
  \hline
  15\% & 0.7642 & 0.7615 & 0.6761 & 0.7077 & 0.7133 \\
  30\% & 0.5700 & 0.5639 & 0.4404 & 0.4833 & 0.4882 \\
  45\% & 0.4135 & 0.4077 & 0.2679 & 0.3166 & 0.3188 \\
  60\% & 0.2827 & 0.2794 & 0.1460 & 0.1908 & 0.1993 \\
\hline \end{tabular}
\caption{Mean accuracy of observation level imputation for \pkg{imputeMulti}, \pkg{Amelia}, 
  \pkg{Hmisc}, and \pkg{mice} using a single imputation.}
\label{tbl:impute_accuracy_test1}
\end{table}

\begin{table}
\centering
\begin{tabular}[h]{l|ccccc} \hline 
  Percent Missing & IM-EM &  IM-DA & Amelia & Hmisc & mice \\
  \hline
  15\% & 0.7651 & 0.7615 & 0.6857 & 0.7175 & 0.7166 \\
  30\% & 0.5687 & 0.5645 & 0.4493 & 0.4968 & 0.4966 \\
  45\% & 0.4097 & 0.4070 & 0.2816 & 0.3233 & 0.3298 \\
  60\% & 0.2853 & 0.2857 & 0.1653 & 0.1994 & 0.2052 \\
\hline \end{tabular}
\caption{Mean accuracy of observation level imputation for \pkg{imputeMulti}, \pkg{Amelia}, 
  \pkg{Hmisc}, and \pkg{mice}. A single imputation from \pkg{imputeMulti} is compared to the 
  maximum accuracy from ten multiple imputations of the comparison methods.}
\label{tbl:impute_accuracy_test2}
\end{table}

%------------------------------------------------------------
\section{Conclusion} \label{sec:conclude}
%------------------------------------------------------------

Deciding how to deal with missing values is a frequent concern for applied researchers. Although 
many methods exist for missing value imputation, the vast majority rely on
assumptions of multivariate normality. It is often desirable to use a model 
specifically designed for the distribution from which a researcher's data is generated. \pkg{imputeMulti}
provides an easily accessible model for imputing multivariate multinomial data. Further, \pkg{imputeMulti} 
integrates easily into common analyses and handles large datasets with high performance by providing 
functions for calculating sufficient statistics in \proglang{C++}. \pkg{imputeMulti}
also allows parallel processing in the case of extremely large datasets.

This article provides a hands-on introduction to the \pkg{imputeMulti} package as well as comparison
to existing methods in \proglang{R} for multivariate multinomial imputation. As shown, the use of
a package specifically designed for multivariate multinomial imputation, such as \pkg{imputeMulti},
provides exact solutions for datasets with small numbers of variables; and, for datasets with large
numbers of variables, \pkg{imputeMulti} may be used by splitting the variables into related groups.
In addition, \pkg{imputeMulti} produces higher observation level imputation accuracy as well as 
providing easier access to parameter estimates. 

Future development of \pkg{imputeMulti} is expected to focus on extending performance by migrating more
of the code to \proglang{C++}, improving the search algorithms used in calculating the 
multinomial distribution sufficient statistics by implementing balanced-trees, and exploring further 
parallelization of parameter estimates. As with any parallelization effort, the tradeoff between 
parallel overhead and increased utilization of multi-core computing resources must be intelligently 
balanced. To date, timing tests on Windows machines, which do not allow forking, have indicated that 
the parallel overhead is only justified for datasets with several hundred thousand or more observations.


%------------------------------------------------------------
% Bibliography
% \clearpage
\bibliography{imputeMulti}
\nocite{*}
%------------------------------------------------------------
\end{document}