---
title: "Using the tm.plugin.koRpus Package for Corpus Analysis"
author: "m.eik michalke"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: cerulean
    highlight: kate
    toc: true
    toc_float: 
      collapsed: false
      smooth_scroll: false
    toc_depth: 3
    includes: 
      in_header: vignette_header.html
abstract: >
  The R package `tm.plugin.koRpus` is an extension to the `koRpus` package, enhancing its usability for actual corpus analysis. It adds object classes and methods, inheriting from those provided by `koRpus`,
  which are designed to work with complete text corpora in both `koRpus` and `tm` formats. This vignette gives you a quick overview.
vignette: >
  %\VignetteIndexEntry{Using the tm.plugin.koRpus Package for Text Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8x]{inputenc}
  \usepackage{lmodern}
  % \usepackage[apaciteclassic]{apacite}
---

```{r setup, include=FALSE}
header_con <- file("vignette_header.html")
writeLines('<meta name="flattr:id" content="4zdzgd" />', header_con)
close(header_con)
```
<!--
```{r, include=FALSE, cache=FALSE}
# load knitr for better graphics support
library(knitr)
```
-->

# What is tm.plugin.koRpus?
While the `koRpus` package focusses mostly on analysis steps of individual texts, `tm.plugin.koRpus` adds a new object class and respective methods, which can be used
to analyse complete text corpora in a single step. The object class can also be a first step to building a bridge between the `koRpus` and `tm` packages.

At the core of this package there is one particular object class -- `kRp.corpus` -- which can be used to construct simple corpus objects or even hierarchically nested corpora.
That is, you are able to categorize corpora on as many levels as you need. The examples in this vignette use two levels, one being different *topics* the texts
in the sample corpus deal with, and the other different *sources* the texts come from.

If you don't need these hierarchical levels, you can just use the method `readCorpus()` to create a corpus object, i.e., a simple collection of texts.
To distinguish texts which came from different sources or deal with different topics, use the `hierarchy` argument, which will add categorial columns to
the tagged text objects. These objects will only be valid if there are texts of each topic from each source.

Now, if this still confuses you, let's look at a small example.

# Tokenizing corpora

As with `koRpus`, the first step for text analysis is tokenizing and possibly POS tagging. This step is performed by the `readCorpus()` method mentioned above.
The package includes four sample texts taken from Wikipedia^[See the file `tests/testthat/samples/License\_of\_sample\_texts.txt` for details] in its `tests` directory which
we can use for a demonstration:

```{r, eval=FALSE}
library(tm.plugin.koRpus)
library(koRpus.lang.de)
# set the root path to the sample files
sampleRoot <- file.path(path.package("tm.plugin.koRpus"), "tests", "testthat", "samples")
# the next call uses "hierarchy" to describe the directory structure
# and its meaning; see below
sampleTexts <- readCorpus(
  dir=sampleRoot,
  hierarchy=list(
    Topic=c(
      C3S="C3S SCE",
      GEMA="GEMA e.V."
    ),
    Source=c(
      Wikipedia_alt="Wikipedia (alt)",
      Wikipedia_neu="Wikipedia (neu)"
    )
  ),
  tagger="tokenize",
  lang="de"
)
```
```
Processing corpus...
  Topic "C3S SCE", 2 texts...
    Source "Wikipedia (alt)", 1 text...
    Source "Wikipedia (neu)", 1 text...
  Topic "GEMA e.V.", 2 texts...
    Source "Wikipedia (alt)", 1 text...
    Source "Wikipedia (neu)", 1 text...
```

## The `hierarchy` argument

The `hierarchy` argument describes our corpus in a very condensed format. It is a named list of named character vectors,
where each list entry represents a hierarchical level. In this case, the top level is called *"Topics"*, below that is the
level *"Source"*. These hierachical levels must also be represented by the directory structure of the texts to parse,
and the *names* of the character vectors must be identical to the *directory names* below the root directory specified by `dir`.^[
Future versions of this package might add furter ways of describing your corpus, like using a configuration file or providing
a full corpus in XML or JSON format. But don't hold your breath.]

So on your file system, what the `hierarchy` argument above describes is the following layout:

```
.../samples/
  C3S/
    Wikipedia_alt/
      Text01.txt
      Text02.txt
      ...
    Wikipedia_neu/
      Text03.txt
      Text04.txt
      ...
  GEMA/
    Wikipedia_alt/
      Text05.txt
      Text06.txt
      ...
    Wikipedia_neu/
      Text07.txt
      Text08.txt
      ...
```

Since we're using the `koRpus` package for all actual analysis, you can also setup your environment with `set.kRp.env()` and POS-tag all texts with `TreeTagger`^[see the `koRpus`
documentation for details.].

# Analysing corpora

After the initial tokenizing, we can analyse the corpus by calling the provided methods, for instance lexical diversity:

```{r, eval=FALSE}
sampleTexts <- lex.div(sampleTexts)
corpusSummary(sampleTexts)
```
```
                                                                       doc_id     Topic
C3S-Wikipedia_alt-C3S_2013-09-24.txt     C3S-Wikipedia_alt-C3S_2013-09-24.txt   C3S SCE
C3S-Wikipedia_neu-C3S_2015-07-05.txt     C3S-Wikipedia_neu-C3S_2015-07-05.txt   C3S SCE
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA e.V.
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA e.V.
                                                Source    a    C CTTR   HDD     K lgV0 MATTR MSTTR
C3S-Wikipedia_alt-C3S_2013-09-24.txt   Wikipedia (alt) 0.16 0.95 6.13 38.14 49.92 6.21  0.81  0.79
C3S-Wikipedia_neu-C3S_2015-07-05.txt   Wikipedia (neu) 0.17 0.94 6.82 38.05 54.88 6.10  0.82  0.76
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt Wikipedia (alt) 0.17 0.94 7.07 37.61 65.08 6.11  0.80  0.78
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt Wikipedia (neu) 0.16 0.94 7.13 37.87 60.14 6.24  0.81  0.79
                                         MTLD MTLDMA     R    S  TTR     U
C3S-Wikipedia_alt-C3S_2013-09-24.txt   100.16     NA  8.68 0.93 0.78 39.92
C3S-Wikipedia_neu-C3S_2015-07-05.txt   123.01     NA  9.65 0.92 0.73 36.46
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt 106.94    192 10.00 0.92 0.71 35.96
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt 111.64     NA 10.08 0.92 0.73 37.47
```

As you can see, `corpusSummary()` returns a `data.frame` object with the summarised results of all
texts. Here's an example how to use this to plot interactions:

```{r, eval=FALSE}
library(sciplot)
lineplot.CI(
  x.factor=corpusSummary(sampleTexts)[["Source"]],
  response=corpusSummary(sampleTexts)[["MTLD"]],
  group=corpusSummary(sampleTexts)[["Topic"]],
  type="l",
  main="MTLD",
  xlab="Media source",
  ylab="Lexical diversity score",
  col=c("grey", "black"),
  lwd=2
)
```
<figure>
  <img src="lineplot.jpg" />
  <figcaption>Well, the example texts aren't so impressive here, as there's not much variance in one text per source and topic.</figcaption>
</figure>

There are quite a number of `corpus*()` getter/setter methods for slots of these objects, e.g.,
`corpusReadability()` to get the `readability()` results from objects of class `kRp.corpus`.

The S4 object class provided by `tm.plugin.koRpus` directly inherits its structure from `kRp.text` of the `koRpus` package,
adding additional slots for meta information and `Corpus` objects of the `tm` package for raw data.

Two methods can be especially helpful for further analysis. The first one is `tif_as_tokens_df()` and returns
a `data.frame` including all texts of the tokenized corpus in a format that is compatible with
[Text Interchange Formats](https://github.com/ropensci/tif) standards.

The second one is a family of `[`, `[<-`, `[[` and `[[<-` shorcuts to directly interact with the
`data.frame` object you would get via `taggedText()`.

## Frequency analysis

The object class makes it quite comfortable to analyse type frequencies of corpora. There is a method
`read.corp.custom()` for these classes, that will do this analysis recursively on all levels:

```{r, eval=FALSE}
sampleTexts <- read.corp.custom(sampleTexts, case.sens=FALSE)
sampleTextsWordFreq <- query(
  corpusCorpFreq(sampleTexts),
  var="wclass",
  query=kRp.POS.tags(lang="de", list.classes=TRUE, tags="words")
)
head(sampleTextsWordFreq, 10)
```
```
   num    word lemma      tag wclass lttr freq         pct  pmio    log10 rank.avg rank.min
3    3     die       word.kRp   word    3   30 0.037220844 37220 4.570776    263.0      263
4    4     der       word.kRp   word    3   21 0.026054591 26054 4.415874    262.0      262
5    5    gema       word.kRp   word    4   17 0.021091811 21091 4.324097    260.5      260
6    6     und       word.kRp   word    3   17 0.021091811 21091 4.324097    260.5      260
7    7   einer       word.kRp   word    5   12 0.014888337 14888 4.172836    258.5      258
8    8     von       word.kRp   word    3   12 0.014888337 14888 4.172836    258.5      258
11  11     ist       word.kRp   word    3   10 0.012406948 12406 4.093632    256.0      255
12  12     bei       word.kRp   word    3    9 0.011166253 11166 4.047898    254.0      254
13  13     das       word.kRp   word    3    8 0.009925558  9925 3.996731    252.5      252
14  14 urheber       word.kRp   word    7    8 0.009925558  9925 3.996731    252.5      252
   rank.rel.avg rank.rel.min inDocs     idf
3      99.24528     99.24528      4 0.00000
4      98.86792     98.86792      4 0.00000
5      98.30189     98.11321      4 0.00000
6      98.30189     98.11321      4 0.00000
7      97.54717     97.35849      4 0.00000
8      97.54717     97.35849      4 0.00000
11     96.60377     96.22642      4 0.00000
12     95.84906     95.84906      4 0.00000
13     95.28302     95.09434      4 0.00000
14     95.28302     95.09434      2 0.30103
```

In combination with the `wordcloud` package, this can directly be used to plot
word clouds:

```{r, eval=FALSE}
require(wordcloud)
colors <- brewer.pal(8, "RdGy")
wordcloud(
  head(sampleTextsWordFreq[["word"]], 200),
  head(sampleTextsWordFreq[["freq"]], 200),
  random.color=TRUE,
  colors=colors
)
```
<figure>
  <img src="wordcloud.jpg" />
  <figcaption>The 200 most frequent words in the example corpus</figcaption>
</figure>

<!--  # Reference -->
