---
title: fastTextR - Text Classification
output: html_document
vignette: >
  %\VignetteIndexEntry{Text_Classification}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

**fastTextR** is an **R** interface to the [fastText](https://github.com/facebookresearch/fastText)
library. It can be used to **word representation learning** *(Bojanowski et al., 2016)* and 
**supervised text classification** *(Joulin et al., 2016)*.
Particularly the advantage of **fastText** to other software is that, 
it was designed for biggish data.

The following example is based on the examples provided in the **fastText** library, 
the example shows how to use **fastTextR** text classification.

## Download Data    

```{r, eval=FALSE}
options(width=100L)
fn <- "dbpedia_csv.tar.gz"

if ( !file.exists(fn) ) {
    download.file("https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz",
                  fn)
    untar(fn)
}
```

## Normalize Data
In **fastText** labels are typically marked with `__label__1` to `__label__k`.
Since **fastText** relies at the order of the trainings data it is important
to ensure the order of the trainings data follows no particular pattern
(which is done here with `sample`). The function `normalize` mimics
the data preparation steps of the bash function `normalize_text`
as shown in 
[classification-example.sh](https://github.com/facebookresearch/fastText/blob/master/classification-example.sh).


```{r, eval=FALSE}
library("fastTextR")

train <- sample(sprintf("__label__%s", readLines("dbpedia_csv/train.csv")))
head(train, 2)
```

```
## [1] "__label__5,\"Helmut Haussmann\",\" Helmut Haussmann (born 18 May 1943) is a German academic and politician. He served as minister of economy from 1988 to 1991.\""                                                                                                                                                                              
## [2] "__label__9,\"Studzianki Rawa County\",\" Studzianki [stuˈd͡ʑaŋki] is a village in the administrative district of Gmina Sadkowice within Rawa County Łódź Voivodeship in central Poland. It lies approximately 5 kilometres (3 mi) west of Sadkowice 16 km (10 mi) east of Rawa Mazowiecka and 69 km (43 mi) east of the regional capital Łódź.\""
```

```{r, eval=FALSE}
train <- ft_normalize(train)
writeLines(train, con = "dbpedia.train")

test <- readLines("dbpedia_csv/test.csv")
labels <- trimws(gsub(",.*", "", test))
table(labels)
```

```
## labels
##    1   10   11   12   13   14    2    3    4    5    6    7    8    9 
## 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
```

```{r, eval=FALSE}
test <- ft_normalize(test)
test <- trimws(sub(".*?,", "", test))
head(test, 2)
```

```
## [1] "\" TY KU \" , \" TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits . The privately-held company was founded in 2004 and is headquartered in New York City New York . While based in New York TY KU ' s beverages are made in Japan through a joint venture with two sake breweries . Since 2011 TY KU ' s growth has extended its products into all 50 states . \""                                        
## [2] "\" Odd Lot Entertainment \" , \" OddLot Entertainment founded in 2001 by longtime producers Gigi Pritzker and Deborah Del Prete ( The Wedding Planner ) is a film production and financing company based in Culver City California . OddLot produced the film version of Orson Scott Card ' s sci-fi novel Ender ' s Game . A film version of this novel had been in the works in one form or another for more than a decade by the time of its release . \""
```

## Train Model
After the data preparation the model can be trained and is saved to 
the file `"dbpedia.bin"`.

```{r, eval=FALSE}
cntrl <- ft_control(word_vec_size = 10L, learning_rate = 0.1, max_len_ngram = 2L, 
                    min_count = 1L, nbuckets = 10000000L, epoch = 5L, nthreads = 4L)

model <- ft_train(file = "dbpedia.train", method = "supervised", control = cntrl)
ft_save(model, "dbpedia.bin")
```

## Read Model
A previously trained model can be loaded via the function `read.fasttext`.

```{r, eval=FALSE}
model <- ft_load("dbpedia.bin")
```

## Predict / Test Model
To perform prediction the function `predict` can be used.

```{r, eval=FALSE}
test_pred <- ft_predict(model, newdata=test, k = 1L)
str(test_pred)
```

```
## 'data.frame':	70000 obs. of  3 variables:
##  $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ label: chr  "__label__1" "__label__1" "__label__1" "__label__1" ...
##  $ prob : num  1 0.72 0.999 0.998 0.993 ...
```

```{r, eval=FALSE}
confusion_matrix <- table(truth=as.integer(labels), 
                          predicted=as.integer(gsub("\\D", "", test_pred$label)))
print(confusion_matrix)
```

```
##      predicted
## truth    1    2    3    4    5    6    7    8    9   10   11   12   13   14
##    1  4734   45   13    6   12   50   60    1    2    4    6   12    7   48
##    2    41 4912    1    1    2    0   32    3    1    0    1    0    0    6
##    3    16    2 4817   15   74    0    5    1    0    0    0   23   11   36
##    4     2    1   29 4947   15    3    0    0    0    2    0    0    1    0
##    5     7    5   70   11 4896    3    3    0    1    1    0    0    0    3
##    6    34    1    1    1    3 4936   12    5    0    0    0    2    3    2
##    7    59   31    1    1    6   17 4839   26    8    0    0    1    1   10
##    8     3    1    0    0    1    2   28 4944   16    2    2    0    0    1
##    9     1    1    0    0    2    0   10   17 4967    0    1    1    0    0
##    10    3    0    0    1    0    0    0    5    0 4952   37    1    0    1
##    11   17    1    0    0    0    1    0    2    0   32 4945    1    0    1
##    12    7    0   18    1    0    4    0    0    0    0    0 4937   21   12
##    13    7    1    8    0    0    2    3    1    0    0    0   18 4926   34
##    14   44    7   25    1    2    5    7    3    1    2    1    5   34 4863
```

```{r, eval=FALSE}
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(sprintf("Accuracy: %0.4f", accuracy))
```

```
## [1] "Accuracy: 0.9802"
```


## References

[1] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)

```
@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}
```

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)

```
@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}
```
