---
title: "Small count frequency table suppression"
author: ""
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Small count frequency table suppression}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


```{r include = FALSE}
htmltables = TRUE
if(htmltables){
  source("GaussKable.R")
  P = function(...) G(fun = SuppressSmallCounts, timevar = "geo", ...)
} else { 
  P = function(...) cat("Formatted table not avalable")
}
```


## Introduction 
The `GaussSuppression` package contains several easy-to-use wrapper functions and in this vignette we will look at 
the `SuppressSmallCounts` function.
In this function, small frequencies are primary suppressed.
Then, as always in this package, secondary suppression is performed using the Gauss method.

We begin by creating datasets to be used below.
The first examples are based on `dataset_a`, which has six rows.

```{r}
library(GaussSuppression)
dataset <- SSBtoolsData("example1")
dataset_a <- dataset[dataset$year == "2014", -4]
dataset_b <- dataset[dataset$year == "2015", -4]
dataset_a
```

## An initial basic example

In the function description (`?SuppressSmallCounts`), the only visible parameter is `maxN` in addition to the parameters considered in the define-tables vignette.
In the first example, we use `maxN = 1` which means that zeros and ones are primary suppressed.

```{r}
SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo"), 
                    freqVar = "freq", 
                    maxN = 1)
```

A formatted version of this output is given in Table 1 below. Primary suppressed cells are underlined and labeled in red, while the secondary suppressed cells are labeled in purple.

\
 <p style="text-align: center;"> <font size = 3> **Table 1**: `dimVar = c("age", "geo"), maxN = 1` </font> </p>
```{r echo=FALSE}
P(data = dataset_a, # caption = "Table 1", 
                    dimVar = c("age", "geo"), 
                    freqVar = "freq", 
                    maxN = 1)
```
\

The same output is obtained if microdata is sent as input as illustrated by de code below.  

```{r}
microdata_a <- SSBtools::MakeMicro(dataset_a, "freq")[-4]
output <- SuppressSmallCounts(data = microdata_a, 
                              dimVar = c("age", "geo"), 
                              maxN = 1)
```

A related point is that the third row of the table can be omitted (`data = dataset_a[-3, ]`) since the frequency is zero. 
When the frequency is zero, there is no underlying microdata. 
Later in this vignette, we address scenarios where the inclusion of zeros may be important.

##  An hierarchical table

A more advanced example is obtained by including the variable "eu". 

```{r}
SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"), 
                    freqVar = "freq", 
                    maxN = 2)
```

A formatted version of this output: 

\
 <p style="text-align: center;"> <font size = 3> **Table 2**: `dimVar = c("age", "geo", "eu"), maxN = 2` </font> </p>
```{r echo=FALSE}
P(data = dataset_a, # caption = "Table 2",  
                    dimVar = c("age", "geo", "eu"), 
                    freqVar = "freq", 
                    maxN = 2)
```
\

As described in the define-tables vignette hierarchies are here detected automatically.
The same output is obtained if we first generate hierarchies by: 

```{r}
dimlists <- SSBtools::FindDimLists(dataset_a[c("age", "geo", "eu")])
dimlists
```

And thereafter run SuppressSmallCounts with these hierarchies as input:

```{r eval=FALSE}
SuppressSmallCounts(data = dataset_a[c("age", "geo", "freq")], 
                    hierarchies = dimlists, 
                    freqVar = "freq", 
                    maxN = 2)
```


## Using the formula interface

Using the formula interface is one way to achieve fewer cells in the output: 


```{r}
SuppressSmallCounts(data = dataset_a, 
                    formula = ~age:eu + geo, 
                    freqVar = "freq", 
                    maxN = 2)
```

In the formatted version of this output, blank cells indicate that they are not included in the output. 

\
 <p style="text-align: center;"> <font size = 3> **Table 3**: `formula = ~age:eu + geo, maxN = 2` </font> </p>
```{r echo=FALSE}
P(data = dataset_a, # caption = "Table 3", 
                    formula = ~age:eu + geo, 
                    freqVar = "freq", 
                    maxN = 2)
```
\

## About suppression of zeros

By default, zeros are suppressed in order to protect against attribute disclosure in frequency tables. 
However, there are exceptions.
Below are several options for handling exceptions.

### Zeros not suppressed

One option is to use `protectZeros = FALSE`. 

```{r}
SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 4,
                    protectZeros = FALSE)
```


\
 <p style="text-align: center;"> <font size = 3> **Table 4**: `dimVar = c("age", "geo", "eu"), maxN = 4, protectZeros = FALSE` </font> </p>
```{r echo=FALSE}
P(data = dataset_a, # caption = "Table 4", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 4,
                    protectZeros = FALSE)
```
\

Another possibility that gives the same output is:

```{r}
output <- SuppressSmallCounts(data = dataset_a[-3, ], 
                              dimVar = c("age", "geo", "eu"),  
                              freqVar = "freq", 
                              maxN = 4,
                              extend0 = FALSE, 
                              structuralEmpty = TRUE)
```

Here the zero-frequency row is omitted in the input. 
By default, the table is automatically extended so that the Gauss algorithm handles zeros correctly. 
When this is turned off (`extend0 = FALSE`), a warning with the following text will appear: 
"*Suppressed cells with empty input will not be protected. Extend input data with zeros?*".
However, with `structuralEmpty = TRUE`, the "empty zeros" are assumed to represent structural zeros 
that must not be suppressed. 
As exemplified a little further below, one can thus handle data with both structural and non-structural zeros.


### Secondary suppressed zeros 

We can combine `protectZeros = FALSE` with `secondaryZeros = TRUE`. 

```{r}
SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 3,
                    protectZeros = FALSE, 
                    secondaryZeros = TRUE)
```


\
 <p style="text-align: center;"> <font size = 3> **Table 5**: `dimVar = c("age", "geo", "eu"), maxN = 3,` <br> `protectZeros = FALSE, secondaryZeros = TRUE` </font> </p>
```{r echo=FALSE}
P(data = dataset_a, # caption = "Table 5",  
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 3,
                    protectZeros = FALSE, 
                    secondaryZeros = TRUE)
```
\

### Both structural and non-structural zeros

The example below uses `dataset_b`, which has two zeros.

```{r}
  dataset_b
```

Let's assume that the first zero is considered as a structural zero. 
In order to account for this characteristic, we will exclude this particular zero and retain the other. 
As a general rule, we will exclude all structural zeros.


```{r}
SuppressSmallCounts(data = dataset_b[-2,  ], 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE)
```


\
 <p style="text-align: center;"> <font size = 3> **Table 6**: `dimVar = c("age", "geo", "eu"), maxN = 2,` <br> `extend0 = FALSE, structuralEmpty = TRUE` </font> </p>
```{r echo=FALSE}
P(data = dataset_b[-2,  ], # caption = "Table 6", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE)
```
\

Now, the data has been processed correctly, the structural zeros will be published while the other zeros are suppressed.

To get the same output with the formula interface, we can use the following code:

```{r eval = FALSE}
SuppressSmallCounts(data = dataset_b[-2,  ], 
                    formula = ~age * (geo + eu),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE,
                    removeEmpty = FALSE)
```

Please note that in order to include empty cells in the output, you need to set the `removeEmpty` parameter to `FALSE`. 
By default, this parameter is set to `TRUE` when using the formula interface.


## The problem of singletons and zeros

### The problem of zeros

When using the standard suppression technique on table `dataset_b`, many cells are suppressed. 

```{r}
SuppressSmallCounts(data = dataset_b, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2)
```


\
 <p style="text-align: center;"> <font size = 3> **Table 7**: `dimVar = c("age", "geo", "eu"), maxN = 2` </font> </p>
```{r echo=FALSE}
P(data = dataset_b, # caption = "Table 7", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2)
```
\


The reason for the Spain suppressions is to prevent the disclosure of zeros, 
which would be easily revealed if *young:Spain* is not suppressed. 
In that case the sum of *young:Iceland* and *young:Portugal* can easily be calculated to be zero. 
Since negative frequencies are not possible, the only possibility is two zeros. 

The handling of this problem is standard, but it can be turned off by `singletonMethod = "none"`.


### The problem of singletons

This problem occurs when `protectZeros = FALSE` and `secondaryZeros = FALSE` (default).
We now also look at a larger example that uses `dataset` which has 18 rows.   

```{r}
output <- SuppressSmallCounts(data = dataset, 
                              formula = ~age*geo*year + eu*year,  
                              freqVar = "freq", 
                              maxN = 1, 
                              protectZeros = FALSE)
head(output)
```


\
 <p style="text-align: center;"> <font size = 3> **Table 8**: `formula = ~age*geo*year + eu*year, maxN = 1, protectZeros = FALSE` </font> </p>
```{r echo=FALSE}
P(data = dataset, # caption = "Table 8",
                    formula = ~age*geo*year + eu*year,
                    freqVar = "freq", 
                    maxN = 1,
                    protectZeros = FALSE)
```
\

In this output, *young:2016:Spain* is suppressed due to the standard handling of the singleton problem.

However, by using `singletonMethod = "none"` in this case, *young:2016:Spain* will not be suppressed.
Then the sum of *young:2016:Iceland* and *young:2016:Portugal* can easily be calculated to be two.
Since zeros are never suppressed, the only possible values for these two cells are two ones.


## Intervals

Intervals for the primarily suppressed cells are computed whenever the `lpPackage` parameter is specified. 
Additionally, if `rangePercent` and/or `rangeMin` are provided, further suppression is performed to ensure 
that the interval width requirements are met. 
See the documentation for the `lpPackage` parameter in `?GaussSuppressionFromData` for more details.


In the example below, the interval widths are 4 after applying the standard Gaussian suppression algorithm. 
Since a minimum width of 5 is required, the two cells for Spain are additionally suppressed.


```{r, echo=FALSE, message=FALSE, warning=FALSE}
if (!requireNamespace("Rglpk", quietly = TRUE)) {
  cat("Note: The final part of this vignette requires the suggested package 'Rglpk', which is not installed. That part has been skipped.\n")
  knitr::knit_exit()
}
```

```{r}
output <- SuppressSmallCounts(data = dataset, 
                              formula = ~age*geo,  
                              freqVar = "freq", 
                              maxN = 3, 
                              rangeMin = 5,
                              lpPackage = "Rglpk")
tail(output) 
```
\

In the formatted output shown in Table 9 below, the final intervals are given in parentheses.

\
 <p style="text-align: center;"> <font size = 3> **Table 9**: Output from `SuppressSmallCounts`  with `formula = ~age*geo`, `maxN = 3`, `rangeMin = 5`, `lpPackage = "Rglpk"` </font> </p>
```{r echo=FALSE}
P(data=dataset, 
  formula = ~age*geo,  
  freqVar = "freq", 
  maxN = 3, 
  rangeMin = 5,
  lpPackage = "Rglpk",
  print_expr = 'ifelse(is.na(lo), freq, paste0(freq, " [", lo, ", ", up, "]"))')
```
\


