---
title: "Introduction to datafsm"
author: "John J. Nay and Jonathan M. Gilligan"
date: "Sept. 5, 2015"
output:
  rmarkdown::html_vignette:
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Introduction to datafsm}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Abstract

The **datafsm** package implements a method for automatically generating models 
of dynamic processes that both have strong predictive power and are 
interpretable in human terms. We use an efficient model representation and a 
genetic algorithm-based estimation process. This paper offers a brief overview 
of the software and demonstrates how to use it.


# Introduction

This package implements our method for automatically generating models of 
dynamic decision-making that both have strong predictive power and are 
interpretable in human terms. We use an efficient model representation and a 
genetic algorithm-based estimation process to generate simple deterministic 
approximations that explain most of the structure of complex stochastic 
processes. The genetic algorithm is implemented with the **GA** package 
([Scrucca 2013](https://www.jstatsoft.org/v53/i04/)). Our method, implemented in 
C++ and R, scales well to large data sets. We have applied the package to 
empirical data, and demonstrated the method's ability to recover known 
data-generating processes by simulating data with agent-based models and 
correctly deriving the underlying decision models for multiple agent models and 
degrees of stochasticity.

A user of our package can estimate models by providing their data in a common 
"panel data" format. The package is designed to estimate time series 
classification models that use a small number of binary predictor variables and 
move back and forth between the values of the outcome variable over time. 
Larger sets of predictor variables can be reduced to smaller sets with 
cross-validation. Although the predictor variables must be binary, a 
quantitative variable can be converted into binary by division of the observed 
values into high/low classes. Future releases of the package may include 
additional estimation methods to complement genetic algorithm optimization.

```{r load.data.fsm}
# Load and attach datafsm into your R session, making its functions available:
library(datafsm)
```

Please cite the package if you use it with the text generated by:

```{r cite.datafsm, eval=TRUE, include=TRUE}
citation("datafsm")
```

# Fake Data Example

```{r simulation_parameters, include=FALSE, eval=TRUE}
rounds <- 10
reps <- 1000
noise_level <- 0.05
```
To quickly show it works, we can create fake data. Here, we generate `r reps` 
repetitions of a `r rounds`-round game in which each player starts by making a 
random move, and in subsequent rounds, each player follows a "noisy tit-for-tat" 
strategy that's equivalent to tit-for-tat, except that with a 
`r scales::percent(noise_level, accuracy = 1)` probability the player will make 
a random move.

```{r simulation_parameters, echo=TRUE, eval=FALSE}
```
```{r fake.data, eval=TRUE, include=TRUE}
seed1 <- 1372456261
seed2 <-  597693057
seed3 <-  805078823
set.seed(seed1)
cdata <- data.frame(outcome = NA,
                    period = rep(seq(rounds), reps),
                    my.decision1 = NA,
                    other.decision1 = NA)

#
# Prisoner's dilemma
#
pd_outcome <- function(player_1, player_2) {
  #
  # 1 = C
  # 2 = D
  #
  player_1  + 1
}

tit_for_tat <- function(last_round_self, last_round_opponent) {
    last_round_opponent
}

noisy_tit_for_tat <- function(last_round_self, last_round_opponent, noise_level) {
  if (runif(1,0,1) <= noise_level) {
    sample(0:1,1)
  } else {
    last_round_opponent
  }
}

for (i in seq_along(cdata$period)) {
  if (cdata$period[i] == 1) {
    my.decision <- sample(0:1,1, prob = c(1 - noise_level, noise_level))
    other.decision <- sample(0:1,1, prob = c(1 - noise_level, noise_level))
    cdata[i, "outcome"] <- pd_outcome(my.decision, other.decision)
  } else{
    my.last <- my.decision
    other.last <- other.decision
    my.decision <- noisy_tit_for_tat(my.last, other.last, noise_level)
    other.decision <- noisy_tit_for_tat(other.last, my.last, noise_level)
    cdata[i,c("outcome", "my.decision1", "other.decision1")] <- 
      c(pd_outcome(my.decision, other.decision), my.last, other.last)
  }
}
```

The only required argument of the main function of the package, `evolve_model`, 
is a `data.frame` object, which must have 3-5 columns. The first two columns 
must be named `period` and `outcome` (`period` is the time period that the 
`outcome` action was taken). The remaining one to three columns are predictors, 
and may have arbitrary names. Each row of the `data.frame` is an observational 
unit, an action taken at a particular time and any relevant variables for that 
time. All of the (3-5 columns) should be named. The period and outcome columns 
should be integer vectors --- e.g. `c(1,2,1)` --- and the columns with the 
predictor variable data should be logical vectors --- e.g. 
`c(TRUE, FALSE, FALSE)` --- or vectors that can be coerced to logical with 
`as.logical()`.

Here are the first eleven rows of this fake data:

```{r cdata.table, eval=TRUE, echo=FALSE, results='asis'}
knitr::kable(head(cdata, 11))
```

We can estimate a model with that data. `evolve_model` uses a genetic algorithm 
to estimate a finite-state machine (FSM) model, primarily for understanding and 
predicting decision-making. This is the main function of the `datafsm` package. 
It relies on the `GA` package for genetic algorithm optimization. We chose to 
use a GA because GAs perform well in rugged search spaces to solve integer 
optimization problems, are a natural complement to our binary representation of 
FSMs, and are easily parallelized. 

`evolve_model` takes data on predictors and data on the outcome and 
automatically creates a fitness function that takes training data, an action 
vector that `evolve_model` generates, and a state matrix `evolve_model` 
generates as input and returns numeric vector whose length is the number of 
rows in the training data. `evolve_model` then computes a fitness score for 
that potential solution FSM by comparing it to the provided `outcome` in the 
real training data. This is repeated for every FSM in the population and then 
the probability of selection for the next generation is proportional to the 
fitness scores. If the argument `cv` is set to `TRUE` the function will call 
itself recursively while varying the number of states inside a cross-validation 
loop in order to estimate the optimal number of states, then it will set the 
number of states to that optimal number and estimate the model on the full 
training set.

If the argument `parallel` is set to `TRUE`, then these evaluations are 
distributed across the available processors of the computer using the 
`doParallel` functions; otherwise, the evaluations of fitness are conducted 
sequentially. Because this fitness function that `evolve_model` creates must 
loop through all the training data every time it is evaluated and we need to 
evaluate many possible solution FSMs, the fitness function is implemented in 
C++ to improve its performance.

```{r evolve.model, eval=TRUE, include=TRUE, message=FALSE, warning=FALSE, results='hide'}
set.seed(seed2)
res <- evolve_model(cdata, seed = seed3)
```

`evolve_model` evolves the models on training data and then, if a test set is 
provided, uses the best solution to make predictions on test data. Finally, the 
function returns the GA object and the decoded version of the best string in 
the population. Formally, the function return an `S4` object with slots for:

* `call` Language from the call of the function.
* `actions` Numeric vector with the number of actions.
* `states` Numeric vector with the number of states.
* `GA` S4 object created by `ga()` from the `GA` package.
* `state_mat` Numeric matrix with rows as states and columns as predictors.
* `action_vec` Numeric vector indicating what action to take for each state.
* `predictive` Numeric vector of length one with test data accuracy if test 
  data was supplied; otherwise, a character vector with a message that the user 
  should provide test data for better estimate of generalizable performance.
* `varImp` Numeric vector with length equal to the number of columns in the 
  state matrix, containing relative importance scores for each predictor.
* `timing` Numeric vector length one containing the total elapsed time it took 
  `evolve_model` to execute.
* `diagnostics` Character vector length one, designed to be printed with 
  `base::cat()`.

Use the summary and plot methods on the output of the `evolve_model()` 
function:

```{r plot.fsm, eval=TRUE, include=TRUE, fig.width=5, fig.height=4, fig.cap="Result of `plot()` method call on `ga_fsm` object."}
summary(res)
plot(res, action_label = ifelse(action_vec(res)==1, "C", "D"), 
     transition_label = c('cc','dc','cd','dd'))
```

The diagram shows that `evolve_model` recovered a tit-for-tat model in which 
the player in question ("me") mimics the last action of the opponent.

Use the `estimation_details` method on the output of the `evolve_model()` 
function:

```{r plot.evolution, eval=TRUE, include=TRUE, fig.width=8, fig.height=6, fig.cap="Result of `plot()` method call on ga object, which is obtained by calling `estimation_details()` on `ga_fsm` object."}
suppressMessages(library(GA))
plot(estimation_details(res))
```


# Documentation for `evolve_model`

Check out the documentation for the main function of the package to learn about 
all the options:
```r
?evolve_model
```


# Acknowledgments

This work is supported by U.S. National Science Foundation grants EAR-1416964 
and EAR-1204685.


# Session Info

```{r session.info, eval=TRUE, include=TRUE}
sessionInfo()
```