---
title: "Error estimation"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    df_print: kable
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{error estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(echo = TRUE)
```

For the most part, this document will present the functionalities of the function 
`surveysd::calc.stError()` which generates point estimates and standard errors for
user-supplied estimation functions.

## Prerequisites

In order to use a dataset with `calc.stError()`, several weight columns have to be present. Each
weight column corresponds to a bootstrap sample. In the following examples, we will use the data 
from `demo.eusilc()` and attach the bootstrap weights using `draw.bootstrap()` and `recalib()`. Please 
refer to the documentation of those functions for more detail.

```{r}
library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
                          epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
```


## Estimator functions

The parameters `fun` and `var` in `calc.stError()` define the estimator to be used in the error 
analysis. There are two built-in estimator functions `weightedSum()` and `weightedRatio()` which can
be used as follows.


```{r}
povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)
```

Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income.
By default, the results are calculated separately for each reference period.

```{r}
povertyRate$Estimates
totalIncome$Estimates
```

Columns that use the `val_` prefix denote the point estimate belonging to the "main weight" of the
dataset, which is `pWeight` in case of the dataset used here.

Columns with the `stE_` prefix denote standard errors calculated with bootstrap replicates. The
replicates result in using `w1`, `w2`, ..., `w10` instead of `pWeight` when applying the estimator.

`n` denotes the number of observations for the year and `N` denotes the total weight of those
persons.

### Custom estimators

In order to define a custom estimator function to be used in `fun`, the function needs to have
at least two arguments like the example below.

```{r}
## define custom estimator
myWeightedSum <- function(x, w) {
  sum(x*w)
}

## check if results are equal to the one using `surveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)
```

The parameters `x` and `w` can be assumed to be vectors with equal length with `w` being numeric
weight vector and `x` being the column defined in the `var` argument. It will be called once
for each period (in this case `year`) and for each weight column (in this case `pWeight`, `w1`, `w2`, ..., `w10`).

Custom estimators using additional parameters can also be supplied and parameter `add.arg` can be used
to set the additional arguments for the custom estimator.

```{r}
## use add.arg-argument
fun <- function(x, w, b) {
  sum(x*w*b)
}
add.arg = list(b="onePerson")

err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun,
                        period.mean = 0, add.arg=add.arg)
err.est$Estimates

# compare with direct computation
compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson),
                                 by=c("year")]
all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
```

The above chunk computes the weighted poverty ratio for single person households.

### Adjust variable depending on bootstrap weights

In our example the variable `povertyRisk` is a boolean and is `TRUE` if the income is less
than 60% of the weighted median income. Thus it directly depends on the original weight vector
`pWeight`. To further reduce the estimated error one should calculate for each bootstrap replicate
weight $w$ the weighted median income $medIncome_{w}$ and then define $povertyRisk_w$ as

$$
povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\\
                     0 \quad\text{else}}
$$

The estimator can then be applied to the new variable $povertyRisk_w$.
This can be realized using a custom estimator function.

```{r}
# custom estimator to first derive poverty threshold 
# and then estimate a weighted ratio
povmd <- function(x, w) {
 md <- laeken::weightedMedian(x, w)*0.6
 pmd60 <- x < md
 # weighted ratio is directly estimated inside the function
 return(sum(w[pmd60])/sum(w)*100)
}

err.est <- calc.stError(
  dat_boot_calib, var = "povertyRisk", fun = weightedRatio,
  fun.adjust.var = povmd, adjust.var = "eqIncome")
err.est$Estimates

```


The approach shown above is only valid if no grouping variables are supplied (parameter `group = NULL`).
If grouping variables are supplied one should use parameters `fun.adjust.var` and `adjust.var` such that
the $povertyRisk_w$ is first calculated for each `period` and then used for each grouping in `group`.


```{r}
# using fun.adjust.var and adjust.var to estimate povmd60 indicator
# for each period and bootstrap weight before applying the weightedRatio
povmd2 <- function(x, w) {
 md <- laeken::weightedMedian(x, w)*0.6
 pmd60 <- x < md
 return(as.integer(pmd60))
}

# set adjust.var="eqIncome" so the income vector is used to estimate
# the povmd60 indicator for each bootstrap weight
# and the resulting indicators are passed to function weightedRatio
group <- "gender"
err.est <- calc.stError(
  dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = "gender",
  fun.adjust.var = povmd2, adjust.var = "eqIncome")
err.est$Estimates
```


### Multiple estimators

In case an estimator should be applied to several columns of the dataset, `var` can be set to
a vector containing all necessary columns.

```{r}
multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates
```
Here we see the relative number of persons at risk of poverty and the relative number of one-person
households.

## Grouping

The `groups` argument can be used to calculate estimators for different subsets of the data. This
argument can take the grouping variable as a string that refers to a column name (usually a factor) 
in `dat`. If set, all estimators are not only split by the reference period but also by the
grouping variable. For simplicity, only one reference period of the above data is used.

```{r}
dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)
```

To calculate the ratio of persons at risk of poverty for each federal state of Austria, 
`group = "region"` can be used.

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
```

The last row with `region = NA` denotes the aggregate over all regions. Note that the
columns `N` and `n` now show the weighted and unweighted number of persons in each region.

### Several grouping variables

In case more than one grouping variable is used, there are several options of calling 
`calc.stError()` depending on whether combinations of grouping levels should be regarded or not.
We will consider the variables `gender` and `region` as our grouping variables and show three 
options on how `calc.stError()` can be called.

#### Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of 
rows in the output is therefore 

$$n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.$$

The last row is again the estimate for the whole period.

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = c("gender", "region"))
povertyRates$Estimates

```

#### Option 2: All combinations of `region` and `gender`

Split the data by all combinations of the two grouping variables. This will result in a larger 
output-table of the size

$$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.$$

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list(c("gender", "region")))
povertyRates$Estimates
```

#### Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

* every gender,
* every region and
* every combination of region and gender.

The number of rows in the output is therefore 

$$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.$$

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates
```


## Group differences

If differences between groups need to be calculated, e.g difference of poverty rates between `gender = "male"` and `gender = "female"`, parameter `group.diff` can be utilised.
Setting `group.diff = TRUE` the differences and the standard error of these differences for all variables defined in `groups` will be calculated.

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = c("gender", "region"),
                             group.diff = TRUE)
povertyRates$Estimates
```

The resulting output table contains `r nrow(povertyRates$Estimates)` rows. `r sum(povertyRates$Estimates$estimate_type == "direct")` rows for all the direct estimators

$$n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12,$$

and another `r sum(povertyRates$Estimates$estimate_type == "group difference")` for all the differences within the variable `"gender"` and `"region"` seperately.
Variable `"gender"` has 2 unique values (`unique(dat2$gender)`) resulting in 1 difference, ~ `gender = "male"` - `gender = "female"` and variable `"region"` has
9 unique values (`unique(dat2$region)`) resulting in 

$$8 + 7 + 6 + 5 + 4 + 3 + 2 + 1  = \sum\limits_{1=1}^{9-1}i = 36$$

estimates. Thus the output contains 1 + 36 = 37 estimates with respect to group differences.

If a combintaion of grouping variables is used in `group` and `group.diff = TRUE` then differences between combinations will only be calculated
if one of the grouping variables differs.
For example the difference between the following groups would be calculated

- `gender = "female" & region = "Vienna"` - `gender = "male" & region = "Vienna"` 
- `gender = "female" & region = "Vienna"` - `gender = "female" & region = "Salzburg"`
- `gender = "male" & region = "Salzburg"` - `gender = "female" & region = "Salzburg"`

The difference between `gender = "female" & region = "Vienna"` and `gender = "male" & region = "Salzburg"` however would not be calculated.

Thus this leads to 

$$2\cdot(\sum\limits_{1=1}^{9-1}i) + 9\cdot1 = 81$$

results with respect to the differences.
The Output contains an additional column `estimate_type` and 

```{r}
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list(c("gender", "region")),
                             group.diff = TRUE)
povertyRates$Estimates[,.N,by=.(estimate_type)]
```

## Differences between survey periods

Differences of estimates between `period`s can be calculated using parameter `period.diff`.
`period.diff` expects a character vector (if not `NULL`) specifying for which `period`s the differences should be calcualed for.
The inputs should be specified in the form `"period2" - "period1"`.

```{r}
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, 
                             period.diff = c("2017 - 2016", "2016 - 2015", "2015 - 2014"))
povertyRates$Estimates
```

If additional grouping variables are supplied to `calc.stError()` die differences across `period`s are also carried out
for all variables in `group`.

```{r}
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, 
                             group = "gender",
                             period.diff = c("2017 - 2016", "2016 - 2015", "2015 - 2014"))
povertyRates$Estimates
```


## Averages across periods

With parameter `period.mean` averages across `period`s are calculated additional. The parameter accepts only odd integer values.
The resulting table will contain the direct estimates as well as rolling averages of length `period.mean`.

```{r}
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, 
                             period.mean = 3)
povertyRates$Estimates
```

if in addition the parameters `group` and/or `period.diff` are specified then differences and groupings of averages will be calculated.

```{r}
povertyRates <- calc.stError(dat_boot_calib[year>2013], var = "povertyRisk", fun = weightedRatio, 
                             period.mean = 3, period.diff = "2016 - 2015",
                             group = "gender")
povertyRates$Estimates
```