---
title: Exploring the close relationship between topic modeling and non-negative matrix factorization for count data
author: Peter Carbonetto
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Relationship between NMF and topic modeling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
                      fig.align = "center")
```

The 'fastTopics' package implements fast optimization algorithms for
fitting topic models and non-negative matrix factorizations to count
data. This document explains the close relationship between topic
modeling and non-negative matrix factorization (NMF).

These are the packages used in the examples below:

```{r load-pkgs, message=FALSE, warning=FALSE}
library(fastTopics)
library(ggplot2)
library(cowplot)
```

Poisson non-negative matrix factorization
-----------------------------------------

In Poisson non-negative matrix factorization
([Lee & Seung, 2001][lee-seung-2001]), counts $x_{ij}$ in an $n
\times m$ matrix, $X$, are modeled by the Poisson
distribution:
$$
x_{ij} \sim \mathrm{Poisson}(\lambda_{ij}).
$$
Each Poisson rate, $\lambda_{ij}$, is a simple linear combination of
parameters $f_{jk} \geq 0, l_{ik} \geq 0$ to be fitted to the data:
$$
\lambda_{ij} = \sum_{k=1}^K l_{ik} f_{jk},
$$
in which $K$ is a user-specified tuning parameter specifying the rank
of the matrix factorization.

The 'fastTopics' function `fit_poisson_nmf` computes maximum-likelihood
estimates (MLEs) of the Poisson NMF parameters. These parameters are
stored as an $m \times K$ matrix $F$, and an $n \times K$ matrix
$L$. Drawing the connection to PCA, we call $L$ the "loadings matrix"
(also sometimes called the "activations"), and $F$ is the "factors
matrix" (or "basis vectors"). Here we siulate a "toy" data set with
$n = 80$, $m = 100$ and $K = 3$:

```{r fit-poisson-nmf-model, message=FALSE}
set.seed(1)
dat  <- simulate_count_data(80,100,3)
fit1 <- fit_poisson_nmf(dat$X,k = 3,numiter = 200,verbose = "none")
rbind(dim(dat$X),
      dim(fit1$L),
      dim(fit1$F))
```

The fitting procedure finds an $L$ and an $F$ that maximize
the likelihood (or, equivalently, that minimize the "deviance"):

```{r plot-loglik, fig.height=3, fig.width=3.5}
quickplot(x = 1:200,y = fit1$progress$loglik,geom = c("point","line"),
  xlab = "iteration",ylab = "loglik") + theme_cowplot(12)
```

Note that the MLE is not unique; if we multiply the $k$th column of
$F$ by some positive scalar, $a_k$, and divide the corresponding
column of $L$ by the same $a_k$, the likelihood will not change.

The multinomial topic model
---------------------------

In the multinomial topic model ([Hofmann, 1999][hofmann-1999];
[Blei *et al*, 2003][blei-2003]), each sample, or row of $X$, is drawn
from a multinomial distribution:
$$
x_{i1}, \ldots, x_{im} \,|\, t_i \sim \mathrm{Multinom}(t_i, \pi_i),
$$
Here. $t_i = \sum_{j=1}^m x_{ij}$ is the total sample size for the
$i$th row, and the multinomial probabilities are
$$
\pi_{ij} = \sum_{k=1}^K l_{ik}^{\ast} f_{jk}^{\ast}
$$

where $l_{ik}^{\ast} \geq 0, f_{jk}^{\ast} \geq 0$. Note the
similarity to the Poisson rates $\lambda_{ij}$ above. There is,
however, an important difference: the topic model parameters are
additionally constrained so that each row of the loadings matrix,
$L^{\ast}$, and each column of factors matrix, $F^{\ast}$, sums to 1:
$$
\sum_{k=1}^K l_{ik}^{\ast} = 1, \qquad
\sum_{j=1}^m f_{jk}^{\ast} = 1.
$$

With these new constraints, the $\pi_{ij}$'s are guaranteed to be
probabilities, and the factors and loadings have a simple
interpretation: $l_{ik}^{\ast}$ is the proportion that topic $k$
contributes to the counts in sample $i$; $f_{jk}$ is the probability
of item $j$ occuring in topic $k$.

From Poisson NMF to topic model
-------------------------------

Having fitted the Poisson NMF model, the multinomial topic model
is easily recovered:

```{r poisson2multinom-1}
fit.multinom <- poisson2multinom(fit1)
```

The rows of $L$ and columns of $F$ indeed sum to 1:

```{r poisson2multinom-2}
range(rowSums(fit.multinom$L))
range(colSums(fit.multinom$F))
```

The multinomial topic model is in fact a *reparameterization* of the
Poisson NMF model; in particular, the *likelihood does not change with
this reparameterization.* We have the following result:
$$
\mathrm{Multinom}(x_i; t_i, \pi_i) \times \mathrm{Poisson}(t_i;
s_i) = \prod_{j=1}^m \mathrm{Poisson}(x_{ij}; \lambda_{ij}),
$$
in which the $s_i$'s are the "size factors". We can check that this
result holds in the toy data:

```{r loglik-poisson-vs-multinom, fig.height=3, fig.width=3.25}
ll.pnmf     <- loglik_poisson_nmf(dat$X,fit1)
ll.multinom <- loglik_multinom_topic_model(dat$X,fit.multinom) +
               dpois(rowSums(dat$X),fit.multinom$s,log = TRUE)
quickplot(x = ll.pnmf,y = ll.multinom) + theme_cowplot(12)
```

From topic model to Poisson NMF
-------------------------------

We can also easily go in the reverse direction, from multinomial topic
model fit to Poisson NMF fit. Here is an example, using the same toy
data set (the size factor estimates returned by `poisson2multinom`
were removed first to demonstrate that they are not needed):

```{r multinom2poisson-1, fig.height=3, fig.width=3.25}
fit.multinom["s"] <- NULL
fit2    <- multinom2poisson(fit.multinom,dat$X)
loglik1 <- loglik_poisson_nmf(dat$X,fit1)
loglik2 <- loglik_poisson_nmf(dat$X,fit2)
quickplot(x = loglik1,y = loglik2) + theme_cowplot(12)
```

From this plot, we see that the likelihoods for the original Poisson
NMF fit, and the one recovered from the topic model, are exactly the
same. (Note that the factors and loadings will not necessarily be the
same---they will only the same up to a rescaling of the columns).

The reverse transformation from multinomial topic model back to
Poisson NMF is in fact *very* simple, involving only a simple
rescaling of the loadings by the total counts for each sample (the
function `multinom2poisson` is really only needed for additional
bookkeeping):

```{r multinom2poisson-2, fig.height=3, fig.width=3.25}
s           <- rowSums(dat$X)
fit3        <- fit.multinom
fit3$L      <- s * fit3$L
class(fit3) <- c("poisson_nmf_fit","list")
loglik3     <- loglik_poisson_nmf(dat$X,fit3)
quickplot(x = loglik1,y = loglik3) + theme_cowplot(12)
```

Summary
-------

The 'fastTopics' approach aims to combine the advantages of Poisson
NMF and topic modeling.

The Poisson NMF is easier to fit because it doesn't have the
additional constraints.

On ther other hand, the parameters of the topic model are typically
more useful for for understanding data; for example, the
$l_{ik}^{\ast}$'s are topic proportions, which can be used to compare
samples (documents), among other things.

So we provide a function to fit the Poisson NMF model
(`fit_poisson_nmf`), and another function to obtain the topic model
fit (`poisson2multinom`).

[hofmann-1999]: https://doi.org/10.1145/312624.312649

[lee-seung-2001]: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization

[blei-2003]: https://jmlr.csail.mit.edu/papers/v3/blei03a.html