---
title: "tidyCat: Tidy Methods For Categorical Data Analysis"
author: "Michael Friendly"
date: "`r Sys.Date()`"
package: vcdExtra
output: 
  rmarkdown::html_vignette:
  fig_caption: yes
bibliography: ["vcd.bib", "vcdExtra.bib"]
csl: apa.csl
vignette: >
  %\VignetteIndexEntry{tidyCat: Tidy Methods For Categorical Data Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.height = 6,
  fig.width = 7,
#  fig.path = "fig/tidycat-",
  dev = "png",
  comment = "##"
)
```


<center>
<img src="fig/tidyCat-logo.png" height=160>
</center>

> Frequency tables need some Tidy Love ❤️

Tidy methods for quantitative data & models have advanced considerably, but there hasn't been much
development of similar ideas for "categorical data", by which I mean data that is often compactly represented
as $n$-way frequency tables, cross classified by one or more discrete factors. 

What would it take to implement a tidy framework for such data? These notes are, in effect, a call for 
participation in developing a `tidyCat` package for this purpose. Other possible names for this:
`tidyCDA`, `tidyfreq` but `tidyCat` makes for a nice logo!

```{r load-pkgs}
library(MASS)
library(vcdExtra)
```

I see three areas that could be developed here:

## Constructing categorical data sets 
Current non-tidy data forms and operations, following [@FriendlyMeyer:2016:DDAR] are described in
the vignette [Creating and manipulating frequency tables](https://friendly.github.io/vcdExtra/articles/a1-creating.html)

It seems clear that the most flexible and general form, and one that most closely matches a tidy
data frame is **case form**, because this allows for numeric variables as well. Thus:

> Among these, tidy categorical data should best be represented as a `tibble` in **case form**. A tibble is convenient for its' printing.

### Manipulating categorical data sets

The methods `xtabs()`, `table()` and `expand.dft()` described in that vignette allow conversion from one form to another.
Tidy equivalents might be:

* `as_table()`, `as_matrix()`, `as_array()` to convert from any form to table/array form

* Similarly, perhaps `as_caseform()`, `as_freqform()` to convert to those. There is already `as.data.frame(table)` to convert to frequency form, and `expand.dft()` converts that to case form

```{r}
data("HairEyeColor")
hec.df <- as.data.frame(HairEyeColor)
head(hec.df)
# expand to case form
expand.dft(hec.df) |> head()
```


* `vcd::structable()` (and `stats::ftable()`) produce a 'flat' representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic displays). One can be constructed from a table or from a data frame with
a formula method,

```{r}
structable(Titanic)
structable(Sex + Class ~ Survived + Age, data = Titanic)
```

and there are a suite of methods for indexing and selecting parts of an $n$-way table.

* The methods in the `plyr` package (now retired) provided a coherent set of tools for a split-apply-combine
strategy that works nicely with multidimensional arrays. Perhaps there are some useful ideas for frequency tables
that could be resurrected here.

* There is also a role for `purrr` methods and thinking here: $n$-way tables as nested lists/arrays?
The ideas of mapping over these?

## Manipulating factor levels 

Also needed: 

* methods for **recoding and collapsing** the levels of a factor: `forcats::fct_recode()`, `forcats::fct_collapse()`,  `forcats::fct_lump_min()` are useful here.

* methods for **reordering the levels** of a factor, either manually or for some analysis purpose. For example, Data from @Glass:54 gave this 5 x 5 table on the occupations of 3500 British fathers and their sons, where the occupational categories are listed in alphabetic order.
  
```{r glass}
data(Glass, package="vcdExtra")
str(Glass)
(glass.tab <- xtabs(Freq ~ father + son, data=Glass))
```

This can be reordered manually by indexing, to arrange the categories by **status**, giving an order  `Professional` down to `Unskilled`:

```{r glass-order}
# reorder by status
ord <- c(2, 1, 4, 3, 5) 
glass.tab[ord, ord]
```

A more general method is to permute the row and column categories in the order implied by correspondence analysis dimensions. This is implemented in the [`seriation` package](https://cran.r-project.org/package=seriation) using the `CA` method of `seriation::seriate()` and applying `permute()` to the result.

```{r housetasks-seriation}
library(seriation)
order <- seriate(glass.tab, method = "CA")
# the permuted row and column labels
rownames(glass.tab)[order[[1]]]

# reorder rows and columns
permute(glass.tab, order)
```
What are tidy ways to do these things?


## Models

The standard analysis of frequency data is in the form of loglinear models fit by `MASS::loglm()`
or with more flexible versions fit with `glm()` or `gnm::gnm()`. These are essentially linear models for the log
of frequency. In `vcdExtra`, there are several methods for a list of such models, of class `"glmlist"` and `"loglmlist"`,
and these should be accommodated in tidy methods. 

What is needed are `broom` methods for `loglm` models. The information required is accessible from standard functions, but not in a tidy form.

* `glance.loglm()` -- **model level** statistics. These are given in the output of the `print()` method, and available from the `print()` method. A complication is both LR and Pearson $\chi^2$ are reported, so these would need to be made to appear in separate columns. There are also related `LRstats()` functions in `vcdExtra`, which report `AIC` and `BIC`.

```{r}
hec.indep <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.indep
# extract test statistics
summary(hec.indep)$tests
LRstats(hec.indep)
```

* `tidy.loglm()` --- **coefficient level** statistics. These are available from `coef.loglm()`. They would need to assembled into a long format. Standard errors & p-values might be a problem.

```{r}
coef(hec.indep)
```

* `augment.loglm()` --- should give **case level** statistics: fitted values, residuals, ...

```{r}
fitted(hec.indep)
residuals(hec.indep)
```

What about `hatvalues`? Not implemented, but shouldn't be too hard.

```{r error=TRUE}
hatvalues(hec.indep)
```


## Graphical methods

The most common graphical methods are those implemented in `vcd`: `mosaic()` association plots (`assoc()`), ..., which
rely on `vcd::strucplot()` described in [@vcd:Meyer+Zeileis+Hornik:2006b].

Is there a tidy analog that might work with `ggplot2`? The [`ggmosaic` package](https://github.com/haleyjeppson/ggmosaic) implements basic marimeko-style mosaic plots. They are not very general, in that they cannot do residual-based shading to show the patterns of association.

However, they are based on a [productplots](https://github.com/hadley/productplots) package by Hadley, which seems to
provide some basic structure for constructing such displays of nested rectangles.

## Some tidy stuff

* The [infer package](https://infer.netlify.app/) has some interesting stuff for CDA. 
The vignette [Tidy Chi-Squared Tests with infer](https://infer.netlify.app/articles/chi_squared)
expresses do a Chisq test of independences as:

```
observed_indep_statistic <- gss |>
  specify(college ~ finrela) |>
  hypothesize(null = "independence") |>
  calculate(stat = "Chisq")
```
You can do simulation tests like this:

```
null_dist_sim <- gss |>
  specify(college ~ finrela) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "Chisq")
```

## References
