--- title: "5. Document summarization" author: "Thomas W. Jones" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{5. Document summarization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # handle package dependencies in Suggests gracefully need <- function(pkgs) all(vapply(pkgs, requireNamespace, TRUE, quietly = TRUE)) ``` In this example we'll use text embeddings and a bit of network analysis to build a basic document summarizer. Many document summarizers, as the one we'll build here, do not generate language. Instead, they break a document down into sentences and then use some mechanism to score each sentence for relevance. Sentences with the top scores are returned as the "summary." For more information on summarization, a good place to start is [here](https://en.wikipedia.org/wiki/Automatic_summarization). The summarizer we'll build is a version of the [TextRank algorithm](https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approach:_TextRank). We will split a document into sentences, create a nearest-neighbor network where sentences are connected to other similar sentences, and rank the sentences according to [eigenvector centrality](https://en.wikipedia.org/wiki/Eigenvector_centrality). We will use a word embedding model, created on a whole corpus, to project the sentences into the embedding space. Once in the embedding space, we will measure similarity between documents using [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance). Hellinger distance is a metric specifically for probability distributions. Since we'll use LDA to create embeddings to a probability space, it's a useful measure. # Getting started We'll use the movie review data set from `text2vec` again. The first thing we need to do is create a TCM and embedding model. We will skip evaluation such as R-squared, coherence, inspecting top terms, etc. However, in any real application, I'd strongly suggest evaluating your models at every step of the way. ```{r embedding} library(textmineR) # load the data data(movie_review, package = "text2vec") # let's take a sample so the demo will run quickly # note: textmineR is generally quite scaleable, depending on your system set.seed(123) s <- sample(1:nrow(movie_review), 200) movie_review <- movie_review[ s , ] # let's get those nasty "
" symbols out of the way movie_review$review <- stringr::str_replace_all(movie_review$review, "
", "") # First create a TCM using skip grams, we'll use a 5-word window # most options available on CreateDtm are also available for CreateTcm tcm <- CreateTcm(doc_vec = movie_review$review, skipgram_window = 10, verbose = FALSE, cpus = 1) # use LDA to get embeddings into probability space # This will take considerably longer as the TCM matrix has many more rows # than a DTM embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 200, burnin = 180, alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood = FALSE, calc_coherence = FALSE, calc_r2 = FALSE, cpus = 1) ``` # Building a basic document summarizer Let's use the above embeddings model to create a document summarizer. This will return the three most relevant sentences in each review. The summarizer works best as a function, as we have many documents to summarize. The function `summarizer` is defined in the next section. However, let's look at some key bits of code in detail. The variable `doc` represents a single document, or a single element of a character vector. In the code chunk below, we split the document into sentences using the `stringi` package. Then we embed each sentence under the model built on our whole corpus, above. ```{r eval = FALSE} # parse it into sentences sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]] names(sent) <- seq_along(sent) # so we know index and order # embed the sentences in the model e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 1) # remove any documents with 2 or fewer words e <- e[ rowSums(e) > 2 , ] vocab <- intersect(colnames(e), colnames(gamma)) e <- e / rowSums(e) e <- e[ , vocab ] %*% t(gamma[ , vocab ]) e <- as.matrix(e) ``` Next, we measure the distance between each of the sentences within the embedding space. ```{r eval = FALSE} # get the pairwise distances between each embedded sentence e_dist <- CalcHellingerDist(e) ``` Since we are using a distance measure whose values fall between $0$ and $1$, we can take $1 - distance$ to get a similarity. We'll also re-scale it to be between 0 and 100. (The rescaling is just a cautionary measure so that we don't run into numerical precision issues when performing calculations downstream.) ```{r eval = FALSE} # turn into a similarity matrix g <- (1 - e_dist) * 100 ``` If you consider a similarity matrix to be an adjacency matrix, then you have a fully-connected graph. For the sake of potentially faster computation and with the hope of eliminating some noise, we will delete some edges. Going row-by-row, we will keep connections only to the top 3 most similar sentences. ```{r eval = FALSE} # we don't need sentences connected to themselves diag(g) <- 0 # turn into a nearest-neighbor graph g <- apply(g, 1, function(x){ x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0 x }) # by taking pointwise max, we'll make the matrix symmetric again g <- pmax(g, t(g)) ``` Using the `igraph` package (with its own objects) to calculate eigenvector centrality. From there, we'll take the top three sentences. ```{r eval = FALSE} g <- igraph::graph.adjacency(g, mode = "undirected", weighted = TRUE) # calculate eigenvector centrality ev <- igraph::evcent(g) # format the result result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ] result <- result[ order(as.numeric(names(result))) ] paste(result, collapse = " ") ``` # Pulling it all together The code below puts it all together in a single function. The first few lines vectorize the code, so that we can summarize multiple documents from a single function call. ```{r eval = need("igraph"), summaries} # let's do this in a function summarizer <- function(doc, gamma) { # recursive fanciness to handle multiple docs at once if (length(doc) > 1 ) # use a try statement to catch any weirdness that may arise return(sapply(doc, function(d) try(summarizer(d, gamma)))) # parse it into sentences sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]] names(sent) <- seq_along(sent) # so we know index and order # embed the sentences in the model e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 1) # remove any documents with 2 or fewer words e <- e[ rowSums(e) > 2 , ] vocab <- intersect(colnames(e), colnames(gamma)) e <- e / rowSums(e) e <- e[ , vocab ] %*% t(gamma[ , vocab ]) e <- as.matrix(e) # get the pairwise distances between each embedded sentence e_dist <- CalcHellingerDist(e) # turn into a similarity matrix g <- (1 - e_dist) * 100 # we don't need sentences connected to themselves diag(g) <- 0 # turn into a nearest-neighbor graph g <- apply(g, 1, function(x){ x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0 x }) # by taking pointwise max, we'll make the matrix symmetric again g <- pmax(g, t(g)) g <- igraph::graph.adjacency(g, mode = "undirected", weighted = TRUE) # calculate eigenvector centrality ev <- igraph::evcent(g) # format the result result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ] result <- result[ order(as.numeric(names(result))) ] paste(result, collapse = " ") } ``` How well did we do? Let's look at summaries from the first three reviews. ```{r eval = need("igraph")} # Let's see the summary of the first couple of reviews docs <- movie_review$review[ 1:3 ] names(docs) <- movie_review$id[ 1:3 ] sums <- summarizer(docs, gamma = embeddings$gamma) sums ``` Compare that to the whole reviews yourself.