---
title: "Getting started"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Getting started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  progress = FALSE,
  error = FALSE, 
  message = FALSE
)

options(digits = 2)
```

## What is the RAKE algorithm?

The Rapid Automatic Keyword Extraction (RAKE) algorithm was first described in [Rose et al.](http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf) as a way to quickly extract keywords from documents. The algorithm invovles two main steps:

**1. Identify candidate keywords.** A candidate keyword is any set of contiguous words (i.e., any n-gram) that doesn't contain a phrase delimiter or stop word.[^1] A phrase delimiter is a punctuation character that marks the beginning or end of a phrase (e.g., a period or comma). Splitting up text based on phrase delimiters/stop words is the essential idea behind RAKE. According to the authors:

> RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words *and*, *the*, and *of*, or other words with minimal lexical meaning

In addition to using stop words and phrase delimiters to identify the candidate keywords, you can also use a word's part-of-speech (POS). For example, most keywords don't contain verbs, so you may want treat verbs as if they were stop words. You can use `slowrake()`'s `stop_pos` parameter to choose which parts-of-speech to exclude from your candidate keywords.

**2. Calculate each keyword's score.** A keyword's score (i.e., its degree of "keywordness") is the sum of its *member word* scores. For example, the score for the keyword "dog leash" is calculated by adding the score for the word "dog" with the score for the word "leash." A member word's score is equal to its degree/frequency, where degree equals the number of times that the word co-occurs with other words (in other keywords), and frequency is the total number of times that the word occurs overall.

See [Rose et al.](http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf) for more details on how RAKE works. 

## Examples

RAKE is unique in that it is completely unsupervised (i.e., no training data is required), so it's relatively easy to use. Let's take a look at a few basic examples that demonstrate `slowrake()`'s various parameters.

```{r}
library(slowraker)

txt <- "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
```

* Use the default settings:
```{r}
slowrake(txt = txt)[[1]]
```

* Don't stem keywords before scoring them: 
```{r}
slowrake(txt = txt, stem = FALSE)[[1]]
```

* Add the word "diophantine" to the default set of stop words (the default set of stop words is `slowraker::smart_words`):
```{r}
slowrake(txt = txt, stop_words = c(smart_words, "diophantine"))[[1]]
```

* Don't use a word's part-of-speech to determine if it's a stop word:
```{r}
slowrake(txt = txt, stop_pos = NULL)[[1]]
```

* Consider words that aren't nouns to be stop words:
```{r}
slowrake(txt = txt, stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)])[[1]]
```

* List the keywords that occur most frequently (`freq`):
```{r}
res <- slowrake(txt = txt)[[1]]
res2 <- aggregate(freq ~ keyword + stem, data = res, FUN = sum)
res2[order(res2$freq, decreasing = TRUE), ]
```

* Run RAKE on a vector of documents instead of just one document:
```{r}
slowrake(txt = dog_pubs$abstract[1:10])
```
[^1]: Technically, the original version of RAKE allows some keywords to contain stop words, but `slowrake()` does not allow for this.