---
title: "Introduction to tidylda"
author: "Tommy Jones"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to tidylda}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}---
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Note: for code examples, see README.md


# Introduction to the `tidylda` package

`tidylda` implements  Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/). `tidylda`'s Gibbs sampler is written in C++ for performance and offers several novel features. It also has methods for sampling from the posterior of a trained model, for more traditional Bayesian analyses.

_tidylda_'s Gibbs sampler has several unique features, described below.

**Non-uniform initialization:** Most LDA Gibbs samplers initialize by assigning words to topics and topics to documents (i.e., construct the $\boldsymbol{Cd}$ and $\boldsymbol{Cv}$ matrices) by sampling from a uniform distribution. This ensures initialization without incorporating any prior information. _tidylda_ incorporates the priors in its initialization. It begins by drawing $P(\text{topic}|\text{document})$ and $P(\text{word}|\text{topic})$ from Dirichlet distributions with parameters $\boldsymbol\alpha$ and $\boldsymbol\eta$, respectively. Then _tidylda_ uses the above probabilities to construct $P(\text{topic}|\text{word}, \text{document})$ and makes a single run of the Gibbs sampler to initialize $\boldsymbol{Cd}$ and $\boldsymbol{Cv}$. This non-uniform initialization powers transfer learning with LDA (tLDA), described elsewhere, by starting a Gibbs run near where the previous run left off. For initial models, it uses the user's prior information to tune where sampling starts.

**Flexible priors:** _tidylda_ has multiple options for setting LDA priors. Users may set scalar values for $\boldsymbol\alpha$ and $\boldsymbol\eta$ to construct symmetric priors. Users may also choose to construct vector priors for both $\boldsymbol\alpha$ and $\boldsymbol\eta$ for a full specification of LDA. Additionally, _tidylda_ allows users to set a matrix prior for $\boldsymbol\eta$, enabled by its implementation of tLDA. This lets users to set priors over word-topic relationships informed by expert input. The best practices for encoding expert input in this manner are not yet well studied. Nevertheless, this capability makes _tidylda_ unique among LDA implementations. 

**Burn in iterations and posterior averaging:** Most LDA Gibbs samplers construct posterior estimates of $\boldsymbol\Theta$ and $\boldsymbol{B}$ from $\boldsymbol{Cd}$ and $\boldsymbol{Cv}$'s values of the final iteration of sampling, effectively using a single sample. This is inconsistent with best practices from Bayesian statistics, which is to average over many samples from a stable posterior. _tidylda_ enables averaging across multiple samples of the posterior with the `burnin` argument. When `burnin` is set to a positive integer, _tidylda_ averages the posterior across all iterations larger than `burnin`. For example, if `iterations` is 200 and `burnin` is 150, _tidylda_ will return a posterior estimate that is an average of the last 50 sampling iterations. This ensures that posterior estimates are more likely to be representative than any single sample.

**Transfer learning with tLDA:** Finally, _tidylda_'s Gibbs sampler enables transfer learning with tLDA. The full specification of tLDA and details on its implementation in _tidylda_ are described briefly in the tLDA vignette and more thoroughly in forthcoming research.

### Tidy Methods

_tidylda_'s construction follows _Conventions of R Modeling Packages_ [@tidymodelsbook]. In particular, it contains methods for `print`, `summary`, `glance`, `tidy`, and `augment`, consistent with other "tidy" packages. These methods are briefly described below.

* `print`, `summary`, and `glance` return various summaries of the contents of a `tidylda` object, into which an LDA model trained with _tidylda_ is stored.
* `tidy` returns the contents of $\boldsymbol\Theta$, $\boldsymbol{B}$, or $\boldsymbol\Lambda$ (stored as `theta`, `beta`, and `lambda` respectively), as specified by the user, formatted as a tidy `tibble`, instead of a numeric matrix.
* `augment` appends model outputs to observational-level data. Taking the cue from [_tidytext_](https://github.com/juliasilge/tidytext), "observational-level" data is one row per word per document. Therefore, the key statistic used by `augment` is $P(\text{topic}|\text{word}, \text{document})$. _tidylda_ calculates this as $\boldsymbol\Lambda \times P(\text{word}|\text{document})$, where $P(\text{word}|\text{document}_d) = \frac{\boldsymbol{x}_d}{\sum_{v=1}^V x_{d,v}}$. 

### Other Notable Features

_tidylda_ has three evaluation metrics for topic models, two goodness-of-fit measures---$R^2$ as implemented from [_mvrsquared_](https://cran.r-project.org/package=mvrsquared) and the log likelihood of the model given the data---and one coherence measure---probabilistic coherence. A flag set during model fitting with `calc_r2 = TRUE`[^calcr2] will return a model with an $R^2$ statistic. Similarly, the log likelihood of the model, given the data, is calculated at each Gibbs iteration if the user selects `calc_likelihood = TRUE` during model fitting. 

The coherence measure is called probabilistic coherence. (See vignette on probabilistic coherence.) Probabilistic coherence is bound between -1 and 1. Values near zero indicate that the top words in a topic are statistically independent from each other. Positive values indicate that the top words in a topic are positively correlated in a statistically-dependent manner. Negative values indicate that the words in a topic are negatively correlated with each other in a statistically-dependent manner. (In practice, negative values tend to be very near zero.)

[^calcr2]: Users can calculate $R^2$ after a model is fit by using the _mvrsquared_ package or calling `tidylda:::calc_lda_rsquared`. `calc_lda_rsquared` is an internal function to _tidylda_, requiring the package name followed by three colons, as is R's standard.

_tidylda_ enables traditional Bayesian uncertainty quantification by sampling from the posterior. The posterior distribution for $\boldsymbol\theta_d$ is $\text{Dirichlet}(\boldsymbol{Cd}_d + \boldsymbol\alpha)$ and the posterior distribution for $\boldsymbol\beta_k$ is $\text{Dirichlet}(\boldsymbol{Cv}_k + \boldsymbol\eta)$ (or $\text{Dirichlet}(\boldsymbol{Cv}_k + \boldsymbol\eta_k)$ for tLDA). _tidylda_ enables a `posterior` method for `tidylda` objects, allowing users to sample from the posterior to quantify uncertainty for estimates of estimated parameters.

_tidylda_ uses one of two calculations for predicting topic distributions (i.e., $\hat{\boldsymbol\theta}_d$) for new documents. The first, and default, is to run the Gibbs sampler, constructing a new $\boldsymbol{Cd}$ for the new documents but without updating topic-word distributions in $\boldsymbol{B}$. The second uses a dot product as described in Appendix 1. _tidylda_ actually uses the dot product prediction combined with the _non-uniform initialization_---described above---to initialize $\boldsymbol{Cd}$ when predicting using the Gibbs sampler.

## Discussion

While many other topic modeling packages exist, _tidylda_ is very user friendly and brings novel features. Its user friendliness comes from compatibility with the tidyverse. And _tidylda_ includes tLDA and other methods contained in the previous chapters of this dissertation. It also has methods for sampling from the posterior of a trained model, for more traditional Bayesian analyses. _tidylda_'s Gibbs sampler is written in C++ for performance.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

```{r eval = FALSE}
install.packages("remotes")

remotes::install_github("tommyjones/tidylda")
```

## Example

```{r example}
library(tidytext)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidylda)
library(Matrix)

### Initial set up ---
# load some documents
docs <- nih_sample 

# tokenize using tidytext's unnest_tokens
tidy_docs <- docs |> 
  select(APPLICATION_ID, ABSTRACT_TEXT) |> 
  unnest_tokens(output = word, 
                input = ABSTRACT_TEXT,
                stopwords = stop_words$word,
                token = "ngrams",
                n_min = 1, n = 2) |> 
  count(APPLICATION_ID, word) |> 
  filter(n>1) #Filtering for words/bigrams per document, rather than per corpus

tidy_docs <- tidy_docs |> # filter words that are just numbers
  filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))

# append observation level data 
colnames(tidy_docs)[1:2] <- c("document", "term")


# turn a tidy tbl into a sparse dgCMatrix 
# note tidylda has support for several document term matrix formats
d <- tidy_docs |> 
  cast_sparse(document, term, n)

# let's split the documents into two groups to demonstrate predictions and updates
d1 <- d[1:50, ]

d2 <- d[51:nrow(d), ]

# make sure we have different vocabulary for each data set to simulate the "real world"
# where you get new tokens coming in over time
d1 <- d1[, colSums(d1) > 0]

d2 <- d2[, colSums(d2) > 0]

### fit an intial model and inspect it ----
set.seed(123)

lda <- tidylda(
  data = d1,
  k = 10,
  iterations = 200, 
  burnin = 175,
  alpha = 0.1, # also accepts vector inputs
  eta = 0.05, # also accepts vector or matrix inputs
  optimize_alpha = FALSE, # experimental
  calc_likelihood = TRUE,
  calc_r2 = TRUE, # see https://arxiv.org/abs/1911.11061
  return_data = FALSE
)

# did the model converge?
# there are actual test stats for this, but should look like "yes"
qplot(x = iteration, y = log_likelihood, data = lda$log_likelihood, geom = "line") + 
    ggtitle("Checking model convergence")

# look at the model overall
glance(lda)

print(lda)

# it comes with its own summary matrix that's printed out with print(), above
lda$summary


# inspect the individual matrices
tidy_theta <- tidy(lda, matrix = "theta")

tidy_theta

tidy_beta <- tidy(lda, matrix = "beta")

tidy_beta

tidy_lambda <- tidy(lda, matrix = "lambda")

tidy_lambda

# append observation-level data
augmented_docs <- augment(lda, data = tidy_docs)

augmented_docs

### predictions on held out data ---
# two methods: gibbs is cleaner and more technically correct in the bayesian sense
p_gibbs <- predict(lda, new_data = d2[1, ], iterations = 100, burnin = 75)

# dot is faster, less prone to error (e.g. underflow), noisier, and frequentist
p_dot <- predict(lda, new_data = d2[1, ], method = "dot")

# pull both together into a plot to compare
tibble(topic = 1:ncol(p_gibbs), gibbs = p_gibbs[1,], dot = p_dot[1, ]) |>
  pivot_longer(cols = gibbs:dot, names_to = "type") |>
  ggplot() + 
  geom_bar(mapping = aes(x = topic, y = value, group = type, fill = type), 
           stat = "identity", position="dodge") +
  scale_x_continuous(breaks = 1:10, labels = 1:10) + 
  ggtitle("Gibbs predictions vs. dot product predictions")

### Augment as an implicit prediction using the 'dot' method ----
# Aggregating over terms results in a distribution of topics over documents
# roughly equivalent to using the "dot" method of predictions.
augment_predict <- 
  augment(lda, tidy_docs, "prob") |>
  group_by(document) |> 
  select(-c(document, term)) |> 
  summarise_all(function(x) sum(x, na.rm = T))

# reformat for easy plotting
augment_predict <- 
  as_tibble(t(augment_predict[, -c(1,2)]), .name_repair = "minimal")

colnames(augment_predict) <- unique(tidy_docs$document)

augment_predict$topic <- 1:nrow(augment_predict) |> as.factor()

compare_mat <- 
  augment_predict |>
  select(
    topic,
    augment = matches(rownames(d2)[1])
  ) |>
  mutate(
    augment = augment / sum(augment), # normalize to sum to 1
    dot = p_dot[1, ]
  ) |>
  pivot_longer(cols = c(augment, dot))

ggplot(compare_mat) + 
  geom_bar(aes(y = value, x = topic, group = name, fill = name), 
           stat = "identity", position = "dodge") +
  labs(title = "Prediction using 'augment' vs 'predict(..., method = \"dot\")'")

# Not shown: aggregating over documents results in recovering the "tidy" lambda.

### updating the model ----
# now that you have new documents, maybe you want to fold them into the model?
lda2 <- refit(
  object = lda, 
  new_data = d, # save me the trouble of manually-combining these by just using d
  iterations = 200, 
  burnin = 175,
  calc_likelihood = TRUE,
  calc_r2 = TRUE
)

# we can do similar analyses
# did the model converge?
qplot(x = iteration, y = log_likelihood, data = lda2$log_likelihood, geom = "line") +
  ggtitle("Checking model convergence")

# look at the model overall
glance(lda2)

print(lda2)


# how does that compare to the old model?
print(lda)
```