---
title: "Text Alignment"
author: "Jan Wijffels"
date: "`r Sys.Date()`"
output:
  html_document:
    fig_caption: false
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Text Alignment with Smith-Waterman}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = TRUE)
```

## Smith Waterman

Smith-Waterman is an algorithm to identify similaries between sequences. The algorithm is explained in detail at https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm and finds a local optimal alignment between 2 sequences of letters.

This package implements the algorithm for sequences of letters as well as sequences of words and is usefull for text analytics researchers.

- The package uses similar code as the textreuse::local_align function and also allows to align character sequences next to aligning word sequences

## Example usage

The package was set up in order to easily

- Find names in documents even if they are not correctly spelled
- Match 2 texts
- Find relevant sequences of texts in other texts

We show some examples of these use cases below.

```{r}
library(text.alignment)
```

### Example matching 2 names 

```{r}
a <- "Gaspard	Tournelly cardeur à laine"
b <- "Gaspard	Bourelly cordonnier"
smith_waterman(a, b)

a <- "Gaspard	T.	cardeur à laine"
b <- "Gaspard	Tournelly cardeur à laine"
smith_waterman(a, b, type = "characters")
```

### Example matching 2 translations

```{r}
a <- system.file(package = "text.alignment", "extdata", "example1.txt")
a <- readLines(a)
a <- paste(a, collapse = "\n")
b <- system.file(package = "text.alignment", "extdata", "example2.txt")
b <- readLines(b)
b <- paste(b, collapse = "\n")
cat(a, sep = "\n")
cat(b, sep = "\n")
```

```{r}
smith_waterman(a, b, type = "words")
```

### Find relevant sequences of texts in other texts

```{r}
x <- smith_waterman("Lange rei", b)
x$b$tokens[x$b$alignment$from:x$b$alignment$to]
overview <- as.data.frame(x)
overview$b_from
overview$b_to
substr(overview$b, overview$b_from, overview$b_to)
```

### Get alignment overview as a data.frame

```{r}
x <- smith_waterman(a, b)
x <- as.data.frame(x, alignment_id = "matching-a-to-b")
str(x)
```