---
title: "Parsing Log Files with Tabulog"
author: "Austin Nar"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parsing Log Files with Tabulog}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tabulog)
library(readr)
library(lubridate)
```

## Introduction to Tabulog

Tabulog is a flexible, powerful framework for parsing log files, specifically
designed for web logs (such as the access.log files created by Apache), with the
final output being in a tabular format.

Parsing logs with Tabulog requires two things: a template, and a list of
"parser classes."

### Tabulog Templates

Inspired by Python's [Jinja2 templates](http://jinja.pocoo.org/), Tabulog templates
use a human-readable format mixing literal text with code. Code is being used
extremely loosely here, as you will see that the 'code' in our templates
is not actually R code. 

The easiest place to start is with an example. Let's say you have a simple log 
file that looks like this:

```
10.0.0.8 - - [2019-01-01:10:58:12 -500] "https://mysite.com/index.html"
173.28.102.33 - - [2019-01-01:10:58:25 -500] "https://mysite.com/login"
...
```

We can see the log file here holds a certain format, specifically:

```
<ip address> - - [<datetime>] "<url>"
```

The Tabulog template to parse such a file looks like this

```
{{ ip ip_address }} - - [{{ Date date_time }}] "{{ url URL }}"
```

Each set of curly brackets represents an instance of a *class*, and is declared 
in the C style of `class var_name`. So in the template above, `{{ ip ip_address }}`
is really saying "In this spot, look for an ip, and call it `ip_address`."

You may ask, how does the Tabulog know what an ip address *is*? Which is where
we are introduced to *parser classes*.

### Parser Classes
In order to know what to look for in each field of our template, Tabulog must 
know what a given class should look like. For this we give it a *parser class*,
which is really just a wrapper object for a regular expression.

In the current example with the ip address, we would tell Tabulog that the
ip class is represented by the Perl regular expression: `[0-9]{1,3}(\.[0-9]{1,3}){3}`.
When Tabulog parsed the log file, it would look for a match on that expression
in that spot, and raise a warning if it didn't find one.

#### Parser Formatters
Once a field is parsed and read into R, you may want to further transform or 
*format* the text. For example, you may want to cast an integer field using
the `as.integer` function. This is achieved using formatters.

When a parser object is created, an optional formatter can be passed. This is
simply a function that takes one argument (a character vector) and returns
a vector of the same length in the desired format. For example, the builtin 
`int` parser is created by the following call:

```{r}
parser('[0-9]+', f = as.integer, name = 'int') # Name is optional
```

Tabulog as a framework is designed to be language-agnostic, so the ideas of
templates and parser classes here will be portable to any other versions
of the package made for other languages. Formatters, however, are language
specific and must be implemented in the language being used. 


## An Example in R

Let's again say you have the example logs in the file `accesslog.txt`.

```{r show_logs, comment=''}
log_file <- 'accesslog.txt'
cat(readr::read_file(log_file))
```

We first define the template as before.

```{r build_template}
template <- '{{ ip ip_address }} - - [{{ date date_time }}] "{{ url URL }}"'
```

We then need to define our classes. `ip` and `url` are builtins with the package,
but dates come in a variety of formats so we must explicitly define ours here.
Note you can see all builtins using `default_classes()`

```{r date_class}
date_parser <- parser(
  '[0-9]{4}\\-[0-9]{2}\\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\\-\\+][0-9]{4}',
  function(x) lubridate::as_datetime(x, format = '%Y-%m-%d:%H:%M:%S %z'),
  name = 'date'
)
date_parser

default_classes()[c('ip', 'url')]
```

Both `ip` and `url` require no formatting, so they have the identity function, 
(`(` in R), as their formatter.

To get our final output in tabular format, we simply make the follow call to 
`parse_logs`.

```{r parse_logs}
# Naming the date_parser 'date' in the list tells Tabulog to use it to parse
# the field with class 'date' in the template.
parse_logs(readLines(log_file), template, classes = list(date = date_parser))
```

Note that we only had to pass our custom class `date`. The builtin classes `ip`
and `url` were included by default.

A more elegant and portable way of completing this task would be to define the
template and the custom class in the same file, which can be ported to other
Tabulog libraries in other languages, leaving only the formatters to be 
defined in the R script. 

First, we define the `template` and the `classes` in a yaml file

```{r yaml_template, comment=''}
template_file <- 'accesslog_template.yml'
cat(readr::read_file(template_file))
```

Next, we define the formatters for each of our classes. Here we only have one, 
but we still put it in a named list, with the name matching the name of the
class in the template file.

```{r custom_formatters}
formatters <- list(
  date = function(x) lubridate::as_datetime(x, format = '%Y-%m-%d:%H:%M:%S %z')
)
```

Finally, we make one call to `parse_logs_file`.

```{r parse_logs_file}
parse_logs_file(log_file, template_file, formatters)
```

## Notes

### Escape characters
The only characters that need to be escaped in templates are curly braces 
(even single ones). Usually a backslash should be sufficient `'\{'`, but the 
html-style escapes `'&#123;'` and `'&#125;'` are also included as valid syntax
for any edge cases that may arise.
