---
title: "bib2df - Parse a BibTeX file to a data.frame"
author: "Philipp Ottolinger"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{bib2df}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)
```
 
## BibTeX

BibTeX is typically used together with LaTeX to manage references. The BibTeX file format is simple as it follows rather simple but strict rules and represents a reference's core data as a list of partly mandatory fields. 

The resulting BibTeX file can tell much about the work you use it for, may it be an academic paper, a dissertation or any other report that at least partly appoints to the work of others. The average age of the referenced works might tell if one addresses to a current topic or if one digs into the history of a certain field. Does one cite many works of just a few authors or occurs every author at most once? The BibTeX file is definitely able to answer these questions.

## Why bib2df?

As mentioned above, BibTeX represents the entries as lists of named fields, some kind of similar to the `JSON` format. If you want to gain insights from your BibTeX file using R, you will have to transform the data to fit into a more usual data structure. Such a data structure, speaking of R,  is the `tibble`. `bib2df` does exactly this: It takes a BibTeX file and parses it into a `tibble` so you can work with your bibliographic data just the way you do with other data.

Given this `tibble` you can manipulate entries in a familiar way and write the updated references back to a valid BibTeX file. 

## How to use

### Parse the BibTeX file

To parse a BibTeX file to a `tibble` you may want to use the function `bib2df()`. The first argument of `bib2df()` is the path to the file you want to parse.

```{r, eval = FALSE}
install.packages("bib2df")
```

```{r}
library(bib2df)

path <- system.file("extdata", "LiteratureOnCommonKnowledgeInGameTheory.bib", package = "bib2df")

df <- bib2df(path)
df
```

`bib2df()` returns a `tibble` with each row representing one entry of the initial BibTeX file while the columns hold the data originally stored in the named fields. If a field was not present in a particular entry, the respective column gets the value `NA`. As some works can be the work of multiple authors or editors, these fields are converted to a list to avoid having the names of multiple persons concatenated to a single character string:

```{r}
head(df$AUTHOR)
```

The second argument of `bib2df()` is `separate_names` and calls, if `TRUE`, the functionality of the `humaniformat` [^1] package to automatically split persons' names into pieces:

```{r}
df <- bib2df(path, separate_names = TRUE)
head(df$AUTHOR)
```

### Parsing multiple BibTex files

Multiple BibTeX files can be parsed using `lapply()`. The paths to the BibTeX files must be stored in a vector. Using this vector to call `bib2df()` within `lapply()` results in a list of `tibble`, which can be bound, e.g. using `bind_rows()`:

```{r}
bib1 <- system.file("extdata", "LiteratureOnCommonKnowledgeInGameTheory.bib", package = "bib2df")
bib2 <- system.file("extdata", "r.bib", package = "bib2df")
paths <- c(bib1, bib2)

x <- lapply(paths, bib2df)
class(x)
head(x)

res <- dplyr::bind_rows(x)
class(res)
head(res)
```


## Analyze and visualize your references

Since the BibTeX entries are now converted to rows and columns in a `tibble`, one can start to analyze and visualize the data with common tools like `ggplot2`, `dplyr` and `tidyr`.

For example, one can ask which journal is cited most among the references

```{r, message=FALSE}
library(dplyr)
library(ggplot2)
library(tidyr)

df %>%
  filter(!is.na(JOURNAL)) %>%
  group_by(JOURNAL) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  slice(1:3)
```

or what the median age of the cited works is:

```{r}
df %>%
  mutate(age = 2017 - YEAR) %>%
  summarize(m = median(age))
```

Also plotting is possible:

```{r, fig.height = 5, fig.width = 7}
df %>% 
  select(YEAR, AUTHOR) %>% 
  unnest() %>% 
  ggplot() + 
  aes(x = YEAR, y = reorder(full_name, desc(YEAR))) + 
  geom_point()
```


## Manipulate your references

Since all the BibTeX entries are represented by rows in a `tibble`, manipulating the BibTeX entries is now easily possible.

One of the authors of the 10th reference in our file does not have his full first name:

```{r}
df$AUTHOR[[10]]
```

The 'E.' in 'E. Dekel' is for Eddie, so lets change the value of that field:

```{r}
df$AUTHOR[[10]]$first_name[2] <- "Eddie"
df$AUTHOR[[10]]$full_name[2] <- "Eddie Dekel"

df$AUTHOR[[10]]
```


## Write back to BibTeX

Especially when single values of the parsed BibTeX file were changed it is useful to write the parsed `tibble` back to a valid BibTeX file one can use in combination with LaTeX. Just like `bib2df()` parses a BibTeX file, `df2bib()` writes a BibTeX file:

```{r,eval=FALSE}
newFile <- tempfile()
df2bib(df, file = newFile)
```

The just written BibTeX file of course contains the values, we just changed in the `tibble`:

```
@Incollection{BrandenburgerDekel1989,
  Address = {New York},
  Author = {Brandenburger, Adam and Dekel, Eddie},
  Booktitle = {The Economics of Missing Markets, Information and Games},
  Chapter = {3},
  Pages = {46 - 61},
  Publisher = {Oxford University Press},
  Title = {The Role of Common Knowledge Assumptions in Game Theory},
  Year = {1989}
}
```

To append BibTeX entries to an existing file use `append = TRUE` within `df2bib()`. 

[^1]: <https://CRAN.R-project.org/package=humaniformat>