---
title: "Visualizing and compressing segregation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Visualizing and compressing segregation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

data.table::setDTthreads(1)

# skip this vignette on CRAN etc.
BUILD_VIGNETTE <- identical(Sys.getenv("BUILD_VIGNETTE"), "true")
knitr::opts_chunk$set(eval = BUILD_VIGNETTE)

library("segregation")
```

The package provides the functions `segcurve()` and `segplot()` to visualize segregation.
These functions return simple ggplots, which can then be further styled and themed.
For the `segplot()` function, it is often interesting to also compress the segregation
information that is contained in large datasets. How to do this using 
the functions `compress()` and `merge_units()` is also described below, and in more
detail [in this working paper](https://osf.io/preprints/socarxiv/ruw4g/).

## Segregation curve

The segregation curve was first introduced by [Duncan and Duncan (1955)](https://www.jstor.org/stable/2088328).
The function `segcurve()` provides a simple way of plotting one or several segregation curves:

```{r}
segcurve(subset(schools00, race %in% c("white", "asian")),
  "race", "school",
  weight = "n",
  segment = "state" # leave this out to produce a single curve
)
```

In this case, state `A` is the most segregated, while state `B` and `C` are similarly segregated,
but at a lower level. Segregation curves are closely related to the index of dissimilarity, and 
here this corresponds to the following index values:

```{r}
# converting to data.table makes this easier
data.table::as.data.table(schools00)[
  race %in% c("white", "asian"),
  dissimilarity(.SD, "race", "school", weight = "n"),
  by = .(state)
]
```


## Segplot

::: {.alert .alert-primary}
Please consider citing the following paper if you use segplot: Benjamin Elbers and Rob Gruijters. 2023. "[Segplot: A New Method for Visualizing Patterns of Multi-Group Segregation](https://doi.org/10.1016/j.rssm.2023.100860). Research in Social Stratification and Mobility.
:::

The function `segplot()` is provided to generate segplots. Segplots are described in more
detail [in this working paper](https://osf.io/preprints/socarxiv/ruw4g/).
The function requires the dataset, the group, and unit variables, and, if required, 
a variable that identifies the weight (`n` in this case). 

Other options to customize the look of the segplot are given by the argument `order`. 
By default, the units of the segplot are ordered by their local segregation score,
but it is also possible to order them by entropy (i.e., diversity) or by share 
of the majority population. This last option can be useful for the two-group case. 
The argument `bar_space` can be used to increase the space between the units 
from the default of zero space between bars. When plotting a subset of the dataset, 
the reference distribution shown on the right of the segplot can be changed by 
supplying a two-column data frame to the `reference_distribution` argument. 
One column of this frame should contain the group identifiers, and 
the other should include the reference proportion of each group. 

Examples of how to use these arguments are given below:

```{r}
sch <- subset(schools00, state == "A")

# basic segplot
segplot(sch, "race", "school", weight = "n")

# order by majority group (white in this case)
segplot(sch, "race", "school", weight = "n", order = "majority")

# increase the space between bars
# (has to be very low here because there are many schools in this dataset)
segplot(sch, "race", "school", weight = "n", bar_space = 0.0005)

# change the reference distribution
# (here, we just use an equalized distribution across the five groups)
(ref <- data.frame(race = unique(schools00$race), p = rep(0.2, 5)))
segplot(sch, "race", "school",
  weight = "n",
  reference_distribution = ref
)
```

It is also possible to show a secondary plot that shows the adjusted local segregation scores:

```{r}
segplot(sch, "race", "school", weight = "n", secondary_plot = "segregation")
```

## Compressing segregation information

The compression algorithm requires three steps to be taken. 
First, it is important to decide which units should be permitted to merge: 
for residential segregation, we may only want to allow neighboring units 
(such as tracts) to be mergeable. In this case, the first step consists
of compiling a data frame with exactly two columns, where each row identifies 
a pair of neighboring units. In other cases, we may want to allow all units 
to be mergeable, in principle. However, this can be very time-consuming as 
it requires each unit to be compared to all others at every step of the merging 
operation. To speed up compression, we therefore implement an option that 
allows units to be merged only within a window of "neighboring" units, 
where the definition of each window is based on similarities in local segregation. 
Hence, for a given unit, only `n_neighbors` are considered at every step, and
these neighbors are based on similarities in local segregation. 
Smaller `n_neighbors` values will result in faster run times, but increase 
the probability of non-optimal merges. The method of merging can be 
specified in the `compress()` function by supplying the argument neighbors.

The second step is then to run the actual compression algorithm using `compress()`. 
For this example, we choose to compress based on a relatively small window:

```{r, results='hide'}
# compression based on window of 20 'neighboring' units
# in terms of local segregation (alternatively, neighbors can be a data frame)
comp <- compress(sch, "race", "school",
  weight = "n", neighbors = "local", n_neighbors = 20
)
```

After running `compress()`—which can take some time depending on how 
many neighbors need to be considered—the output summarizes the compression 
that can be achieved:

```{r}
comp
```

The results indicate that 99% of the segregation information can be retained by 
only 98 units (out of 560 in the original dataset), 95% in only 24 units, 
and 90% in 10 units. The percentage of information retained on each iteration 
can be accessed via the data frame available through `comp$iterations`. 
This data frame can also be used to generate a plot that shows the relationship
between the number of merges and the loss in segregation information:

```{r}
scree_plot(comp)
```

Another way to learn more about the compression is to visualize the information as a dendrogram:

```{r}
dend <- as.dendrogram(comp)
plot(dend, leaflab = "none")
```

The third step is to create a new dataset based on the desired level of compression. 
This can be achieved using the function `merge_units()`, and either `n_units` or `percent`
can be specified to indicate the desired level of compression.

```{r}
sch_compressed <- merge_units(comp, n_units = 15)
# or, for instance: merge_units(comp, percent = 0.80)
head(sch_compressed)
```

The compressed dataset has the same format as the original dataset 
and can now be used to produce another segplot, e.g.

```{r}
segplot(sch_compressed, "race", "school", weight = "n", secondary_plot = "segregation")
```