---
title: Securing data with duawranglr
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Securing data with duawranglr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{dplyr, readr}
---

```{r setup, include = FALSE}
options(width = 90)
knitr::opts_chunk$set(collapse = TRUE, comment = NA)
```

This vignette walks through the process for using duawranglr. It
assumes that the data administrator and researcher have executed a
data usage agreement (DUA) with three potential levels of data
restrictions and created a crosswalk spreadsheet in Excel.

## Administrative file to be wrangled

The raw administrative data file that needs to be processed looks like
this:  

|sid|sname|dob|gender|raceeth|tid|tname|zip|mathscr|readscr|
|:-:|:---:|:-:|:----:|:-----:|:-:|:---:|:-:|:-----:|:-----:|
|000-00-0001|Schaefer|19900114|0|2|1|Smith|22906|515|496|
|000-00-0002|Hodges|19900225|0|1|1|Smith|22906|488|489|
|000-00-0003|Kirby|19900305|0|4|1|Smith|22906|522|498|
|000-00-0004|Estrada|19900419|0|3|1|Smith|22906|516|524|
|000-00-0005|Nielsen|19900530|1|2|1|Smith|22906|483|509|
|000-00-0006|Dean|19900621|1|1|2|Brown|22906|503|523|
|000-00-0007|Hickman|19900712|1|1|2|Brown|22906|539|509|
|000-00-0008|Bryant|19900826|0|2|2|Brown|22906|499|490|
|000-00-0009|Lynch|19900902|1|3|2|Brown|22906|499|493|

And we have a codebook:  

* `sid`: Student social security number  
* `sname`: Student's last name  
* `dob`: Student's date of birth  
* `gender`: Indicator for student gender identification  
* `raceeth`: Factor variable indicatings student's racial/ethnic
  identification  
* `tid`: ID variable for student's teacher  
* `tname`: Last name of student's teacher  
* `zip`: Student's home address zip code  
* `mathscr`: Student's end-of-year test math score  
* `readscr`: Student's end-of-year test reading score  

The `admin_data.csv` file contains observations for 9 students and has 10
variables associated with each observation. Of these, 1 uniquely
identifies each student, 6 are associated with the student's personal
characteristics, 2 with each student's teacher, and 2 with the
student's test scores in reading and math. 

It appears that the school uses the student's social security number
to uniquely identify each student. As researchers interested in test
scores, we have no need for this highly protected data element other
than for its ability to uniquely identify a student or allow linking
to other records. Since we do not need to link to other records at the
moment, any unique number or string will work for our
purposes. Similarly, we don't really need the student's last name.

Besides math (`mathscr`) and reading (`readscr`) scores, we may be
interested in some of the other covariates. It's likely that many of
these data elements, however, also carry restrictions of varying
severity. For example, the school may be able to share the student's
race/ethnicity and gender (provided the student is not otherwise
identified) with most approved researchers, but can only share
teachers' names (`tid`) under more tightly restricted scenarios.

This is where our DUA crosswalk file comes in handy.

## Set DUA

The first step in the process is to set the DUA crosswalk file. The
crosswalk file can be in many different formats and, in most cases,
will be read in automatically no matter the type. (If using a
delimited file that isn't a comma- or tab-separated value format, give
the `delimiter` argument the delimiter string; if using an Excel file
with more than one sheet, give the `sheet` argument the sheet name or
number.) If successful, you will get message telling you so.

```{r, eval = FALSE}
library(tidyverse)
library(duawranglr)

## get crosswalk and admin data files
dua_cw_file <- system.file('extdata', 'dua_cw.csv', package = 'duawranglr')
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')

## set the DUA crosswalk
set_dua_cw(dua_cw_file)
```
```{r, echo = FALSE}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))
library(duawranglr)
dua_cw_file <- system.file('extdata', 'dua_cw.csv', package = 'duawranglr')
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')
set_dua_cw(dua_cw_file)
```
## Check DUA options

In case you've forgotten the data elements that are restricted at a
particular level, you can check them using the `see_dua_options()`
function with the `level` argument set to the appropriate level. If you
want to compare restrictions across more than one level, you can give
the `level` argument a vector.

```{r}
## compare level II and III restrictions
see_dua_options(level = c('level_ii', 'level_iii'))
```
Alternately, you can see restrictions at all levels if you leave the
`level` argument at its default `NULL` value.

```{r}
## check all level restrictions
see_dua_options()
```

## Set DUA level

After consultation with our data partner, we've decided that data for
this project need to be set at Level II. Because no level allows us to
use the current unique ID, `sid`, we also need to deidentify the
data. We could just delete the `sid` column, but for reasons discussed
below, it will be better if we use it to make new, non-identifiable
but unique IDs. Therefore, we use additional arguments in
`set_dua_level()` to note that deidentification is required and set the
targeted ID column.

```{r}
## set DUA level
set_dua_level('level_ii', deidentify_required = TRUE, id_column = 'sid')
```

## Check DUA level

As we're preparing the data, we can check our restriction level and
the data element names it restricts using `see_dua_level()`. 

```{r}
## see set DUA level 
see_dua_level(show_restrictions = TRUE)
```

# Administrative data

After loading some libraries, we'll first read in the raw
administrative data file and confirm that it has nine observations and
the data elements we expect.

```{r, eval = FALSE}
## read in raw administrative data
df <- read_dua_file(admin_file)
df
```

```{r, echo = FALSE}
## read in raw administrative data
df <- readr::read_csv(admin_file,
                      col_types = cols(sid = col_character(),
                                       sname = col_character(),
                                       dob = col_character(),
                                       gender = col_integer(),
                                       raceeth = col_integer(),
                                       tid = col_integer(),
                                       tname = col_character(),
                                       zip = col_integer(),
                                       mathscr = col_integer(),
                                       readscr = col_integer()
                                       )
                      )
df
```	

# Deidentify data

## Single file or no existing crosswalk
```{r, echo = FALSE}
dff <- df
```

We indicated that the data need to be deidentified, so a good first
step in cleaning the raw data is to convert unique student id, `sid`,
into a similarly unique, but unidentifiable value. 

Why not just generate some random string for each value? Though we
don't care to merge these data with other files, we may need to do so
in the future. If we randomly generate new IDs, discarding the old
ones in the process, we will be stuck.

The `deid_dua()` function does two things:

1. It uses a secure `SHA-2` algorithm to convert sensitive IDs into
   unique hexadecimal strings that cannot be reverted back to the
   originial IDs (important in the case such as ours when the unique
   ID is the student's social security number);
2. It has the option to save a crosswalk file that links the old
   secure IDs to the new IDs.
   
Clearly, it defeats the purpose of deidentifying IDs if a crosswalk
between old and new travels with the new data. But if the crosswalk
file is keep in a secure location, perhaps on the same server that
hosts the raw administrative data, then old IDs can be retrieved if
necessary by those with the proper clearance to do so.

```{r, echo = FALSE}
## deidentify data
tmpdir <- tempdir()
df <- deid_dua(df, write_crosswalk = TRUE, id_length = 20,
               crosswalk_filename = file.path(tmpdir, 'tmp.csv'))
```
```{r, eval = FALSE}
## deidentify data
df <- deid_dua(df, write_crosswalk = TRUE, id_length = 20)
```
Here's what the saved crosswalk looks like:

```{r, echo = FALSE}
## show crosswalk
cw <- readr::read_csv(file.path(tmpdir, 'tmp.csv'),
                      col_types = cols(.default = 'c'))
cw
rm(tmpdir)
```
And here now is the data frame:

```{r}
## show data frame
df
```

## Links across multiple files with existing crosswalk

```{r, echo = FALSE}
df <- dff
```

If the deidentified data frame is built from multiple files (*e.g.*, a
panel data set of observations across years), then we'll want to reuse
an existing crosswalk. Otherwise, the same original ID will end up
with multiple new IDs and we won't be able to link observations across
data sets.

Let's say we already have master crosswalk file that looks like this:

```{r, echo = FALSE}
tmpdir <- tempdir()
cw2 <- readr::read_csv('../tests/testthat/testdata/crosswalk_full.csv',
                       col_types = cols(.default = 'c'))
readr::write_csv(cw2, file.path(tmpdir, 'crosswalk_full.csv'))
cw2
```

Rather than create new IDs, we can use the `existing_crosswalk`
argument to read in and use the new IDs we've already made. Everything
else works the same as before.    

```r
df <- deid_dua(df, existing_crosswalk = 'master_crosswalk.csv')
```

```{r, echo = FALSE}
df <- deid_dua(df, existing_crosswalk = file.path(tmpdir, 'crosswalk_full.csv'))
rm(tmpdir)
```

The new ID values now match those from the crosswalk.
```{r}
df
```
```{r, echo = FALSE}
df <- dff
```
## Updates to existing crosswalk

In our example, we have nine students in the current file. Let's say
that though we have a crosswalk, it only has new IDs for the first
five observations:

```{r, echo = FALSE}
tmpdir <- tempdir()
cw3 <- readr::read_csv('../tests/testthat/testdata/crosswalk_partial.csv',
                       col_types = cols(.default = 'c'))
readr::write_csv(cw3, file.path(tmpdir, 'crosswalk_partial.csv'))
cw3
```

If the existing crosswalk doesn't have values for all observations,
then `deid_dua()` will:  

1. Match old IDs with new IDs that **do** exist in the crosswalk
2. Generate new IDs for the old IDs that **don't** exist in the
   crosswalk
3. Update and save the crosswalk

The command is the same for a partial crosswalk as for a complete crosswalk.

```r
df <- deid_dua(df, existing_crosswalk = 'crosswalk_partial.csv')
```

```{r, echo = FALSE}
df <- deid_dua(df, existing_crosswalk = file.path(tmpdir, 'crosswalk_partial.csv'))

```
Notice that the new IDs for the first five observations match those
that were already in the existing crosswalk. The last four are new.
```{r}
df
```
Looking at the partial crosswalk, we see that it now has four new rows
with new IDs each for the observations it didn't have before. 
```{r, echo = FALSE}
cw4 <- readr::read_csv(file.path(tmpdir, 'crosswalk_partial.csv'),
                       col_types = cols(.default = 'c'))
rm(tmpdir)
cw4
```

Should we encounter those students in future files, `deid_dua()` will
use the new IDs we just created.


# Check data frame

If we try to write the data frame using the `write_dua_df()` function,
we get an error.

```{r}
## write data to disk with one last check
write_dua_df(df, 'cleaned_data.csv', output_type = 'csv')
```

Right, we haven't removed all the restricted data elements. Following
the directions, we can check to see what still needs to be removed
using the `check_dua_restrictions()` function.

```{r}
## check
check_dua_restrictions(df)
```

We've successfully removed `sid` already (when we deidentified the
data frame), but still have to remove the student's last name, date of
birth, teacher's name, and zip code to meet level II
restrictions. Once we remove those columns, we can check again.

```{r}
## remove restricted columns
df <- df %>% select(-c(sname, dob, tname, zip))

## check again
check_dua_restrictions(df)
```
Success! And to be sure, here's what our data frame looks like now:

```{r}
df
```

# Write cleaned data frame to disk

Now that we've passed our check, we can write the level II secure data
frame to disk. Just like the `set_dua_cw()` function, which automates
reading in many types of files, `write_dua_df()` will write many types
of files. See `?write_dua_df` for options.

```{r, eval = FALSE}
## write data to disk 
write_dua_df(df, 'cleaned_data_lev_ii.csv', output_type = 'csv')
```

# Interactive template

Particularly for the first few times you use this package, you may
need help remembering the steps. To help the process, the interactive
`make_dua_template()` function will help you make a template script
that you can then modify to meet your data cleaning needs. When
called, the function will ask you a few yes or no questions and, based
on your answers, build a template script that pre-fills some function
arguments. 

An example template script is printed below.

```{r, eval = FALSE}
## save template to disk
make_dua_template('clean_data.R')
```
#### EXAMPLE

```{r, echo = FALSE}
file <- file.path(tempdir(), 'clean_data.R')
make_dua_template(file, answer_list = list('N','','N','',''))
writeLines(readLines(file))
```

