---
title: "Using clean_strings"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using clean_strings}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(fedmatch)
```

# Using clean_strings

`clean_strings` is the way to prepare strings for name matching, either within `tier_match` (see the `Using-tier-match` vignette). There are several useful options that allow for many different options.

Here's the example string we'll be using:

```{r}
name_vec <- corp_data1[, Company]
name_vec
```

First, we can use the basic string cleaning defaults:

```{r}
clean_strings(name_vec)
```

Without any additional arguments, `clean_strings` does the following:

- Make everything lowercase
- Replace the special characters &, @, %, $ with their word equivalents
- Remove all other special characters (e.g. commas, periods)
- Convert tabs to spaces
- Remove extra spaces

Then, we have a few different options we can use. 

## sp_char_words
`sp_char_words` is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. `fedmatch` as a built-in set of symbols:

```{r}
print(sp_char_words)
```

But, you can use any data.frame you'd like, to make whatever replacements you'd like:

```{r}
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
```

## common_words
`common_words` is similar, but it respects word boundaries (so you don't replace every usage of 'Corp' with 'Corporation', for example.) `fedmatch` has a built-in set of 54 words and their replacements:
```{r}
print(corporate_words[1:5])
```

But, you can use whatever words you'd like:

```{r}
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))

```
(bananas motors sounds like a lovely place to work). Note that the 'almart' in 'walmart' didn't get replaced, because common_words respects word boundaries.,

You can also use a related function, `word_frequency`, to look for the most common strings in your data:
```{r}
word_frequency(sample(c("hi", "Hello", "bye    "), 1e4, replace = TRUE))
```

## Remove characters and words
remove_words and remove_char are booleans that let you simply remove the words in 'common_words' or specify a set of characters to remove rather than replacing them.

```{r}
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)
```


## stem

`stem` is a boolean that lets you stem words, using `SnowballC::wordStem`. 'stemming' words means removing common suffixes:

```{r}
clean_strings(c( "call", "calling", "called"), stem = TRUE)
```
See the documentation in `SnowballC::wordStem` for details.

