---
title: "Introduction to `ungroup`"
author: "Marius D. Pascariu, Maciej J. Dańko, Jonas Schöley and Silvia Rizzi"
date: "September 1, 2018"
output:
  pdf_document:
    highlight: tango
    number_sections: yes
    toc: no
fontsize: 11pt
geometry: margin=1.1in
bibliography: REFERENCES.bib
biblio-style: apalike
link-citations: true
vignette: >
  %\VignetteIndexEntry{Introducing ungroup} 
  %\VignetteEngine{knitr::rmarkdown} 
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(collapse = TRUE)
```

# Abstract

The `ungroup` R package introduces a versatile method for ungrouping histograms (binned count data) assuming that counts are Poisson distributed and that the underlying sequence on a fine grid to be estimated is smooth. The method is based on the composite link model and estimation is achieved by maximizing a penalized likelihood. Smooth detailed sequences of counts and rates are so estimated from the binned counts. Ungrouping binned data can be desirable for many reasons: Bins can be too coarse to allow for accurate analysis; comparisons can be hindered when different grouping approaches are used in different histograms; and the last interval is often wide and open-ended and, thus, covers a lot of information in the tail area. Age-at-death distributions grouped in age classes and abridged life tables are examples of binned data. Because of modest assumptions, the approach is suitable for many demographic and epidemiological applications. For a detailed description of the method and applications see @rizzi2015.

# Package Structure  
The package has two top level functions `pclm` and `pclm2D`, two auxiliary functions (`control.pclm` and `control.pclm2D`), several generic functions (`plot`, `summary`, `fitted`, `residuals`). A dataset (`ungroup.data`) is provided as well for testing purposes.

All functions are documented in the standard way, which means that once you load the package using `library(ungroup)` you can just type for example `?pclm` to see the help file.


```{r}
# Load the package
library(ungroup)
```

# Usage
## Univariate Penalized Composite Link Model (PCLM)

The PCLM method [@eilers2007] is based on the composite link model [@thompson1981], which extends standard generalized linear models. It implements the idea that the observed counts, interpreted as realizations from Poisson distributions, are indirect observations of a finer (ungrouped) but latent sequence. This latent sequence represents the distribution of expected means on a fine resolution and has to be estimated from the aggregated data. Estimates are obtained by maximizing a penalized likelihood. This maximization is performed efficiently by a version of the iteratively reweighted least-squares algorithm. Optimal values of the smoothing parameter are chosen by minimizing Bayesian or Akaike's Information Criterion [@hastie1990].

This is an example of estimation of the smooth age at death distributions from grouped death counts. First we have to define some grouped data:

```{r}
# Input data  
# x: Age groups
x <- c(0, 1, seq(5, 85, by = 5))
x
# y: Death counts in the age group
y <- c(294, 66, 32, 44, 170, 284, 287, 293, 361, 600, 998,  
       1572, 2529, 4637, 6161, 7369, 10481, 15293, 39016)
# offset: Population exposed to risk in the age group
offset <- c(114, 440, 509, 492, 628, 618, 576, 580, 634, 657, 
            631, 584, 573, 619, 530, 384, 303, 245, 249) * 1000
# nlast: the size of the last age interval (usually open)
nlast <- 26
# This results in the last group being [85, 110).
```

### Fitting and ungrouping data using PCLM
The model can be fitted using `pclm` function:
```{r, message=FALSE, results='hide'}
M1 <- pclm(x, y, nlast)
```

### Output
It generates different types of output stored in the created object. See `pclm` help page for detailed information about the output list (`?pclm`). 
```{r}
ls(M1)
```

### Summary
```{r}
summary(M1)
```

### plot.pclm
Generic plot:
```{r, fig.align='center', fig.asp=0.8, out.width = '60%'}
plot(M1)

# Print first 6 fitted values
fitted(M1)[1:6]
```

### out.step

By default `pclm` ungroups data in intervals of length `1`. If higher granularity is required `out.step` argument can be used to specify this. For example, obtaining groups 222 groups of length 0.5 one can try: 

```{r, message=FALSE, results='hide', fig.align='center', fig.asp=0.8, out.width = '60%'}
M2 <- pclm(x, y, nlast, out.step = 0.5)
plot(M2)
```

```{r}
# Print first 6 fitted values
fitted(M2)[1:6]
# Number of fitted values
length(fitted(M2))
```

### control.pclm

For controlling the PCLM fitting process `control.pclm` provides several options. The list of arguments needs to be specified using the `control` argument. For example, if we want to optimize the smoothing parameters in order to obtain a fit characterized by the small `AIC` level one can write:

```{r, eval=FALSE}
# Optimise smoothing parameter: lambda, kr and deg
M3 <- pclm(x, y, nlast, 
           control = list(lambda = NA, opt.method = "AIC"))
```

### Offset
The `offset` argument can be used to estimate smooth death rates. `offset` must be a vector of the same length as `y`. 

```{r, message=FALSE, results='hide'}
M5 <- pclm(x, y, nlast, offset)
```

Generic plot:
```{r, fig.align='center', fig.asp=0.8, out.width = '60%'}
plot(M5, type = "s")
```


## Two-dimensional Penalized Composite Link Model 

The PCLM can be extended to a two-dimensional regression problem. The two-dimensional Penalized Composite Link Model to ungroup simultaneously coarse distributions for adjacent years can be fitted using `pclm2D` function, and the structure of the functions works as `pclm`. See the examples provided in the help page. Note that `pclm2D` might be slower, depending on the data and model specification provided in the functions.

The two-dimensional regression analysis combines two approaches: the PCLM for ungrouping in one dimension and two-dimensional smoothing with P-splines [@currie2004]. As an example we can ungroup age-specific distributions from the coarsely grouped data and smooth across adjacent calendar years to estimate both detailed age-at-death distributions and mortality time trends.

# Acknowledgment
We thank Paul H.C. Eilers who provided insight and expertise that greatly supported the creation of this R package; and Catalina Torres and Tim Riffe for testing and offering feedback on the early versions of the software. 

The authors are also grateful to the following institutions for their support:

 * University of Southern Denmark;
 * Max-Planck Institute for Demographic Research;
 * SCOR Corporate Foundation for Science.

# References


