---
title: "Sample Splitting with Caret/SuperLearner"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sample Splitting with Caret/SuperLearner}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "../man/figures/README-"
  )

library(dplyr)
load("../data/star.rda")

# specifying the outcome
outcomes <- "g3tlangss"

# specifying the treatment
treatment <- "treatment"

# specifying the data (remove other outcomes)
star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss))

# specifying the formula
user_formula <- as.formula(
  "g3tlangss ~ treatment + gender + race + birthmonth + 
  birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH + 
  GKBUSED + GKWHITE ")
```

### Train the model with Caret

We can train the model with the `caret` package (for further information about `caret`, see [the original website](http://topepo.github.io/caret/index.html)). We use parallel computing to speed up the computation. 

```{r parallel, message = FALSE, eval = FALSE}
# parallel computing
library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

# stop after finishing the computation
stopCluster(cl)
```

The following example shows how to estimate the ITR with gradient boosting machine (GBM) using the `caret` package. Note that we have already loaded the data and specify the treatment, outcome, and covariates as shown in the [Sample Splitting](sample_split.html) vignette. Since we are using the `caret` package, we need to specify the `trainControl` and/or `tuneGrid` arguments. The `trainControl` argument specifies the cross-validation method and the `tuneGrid` argument specifies the tuning grid. For more information about these arguments, please refer to the [caret website](http://topepo.github.io/caret/model-training-and-tuning.html).

We estimate the ITR with only one machine learning algorithm (GBM) and evaluate the ITR with the `evaluate_itr()` function. To compute `PAPDp`, we need to specify the `algorithms` argument with more than 2 machine learning algorithms.

```{r caret estimate, message = FALSE}
library(evalITR)
library(caret)

# specify the trainControl method
fitControl <- caret::trainControl(
  method = "repeatedcv", # 3-fold CV
  number = 3, # repeated 3 times
  repeats = 3,
  search='grid',
  allowParallel = TRUE) # grid search

# specify the tuning grid
gbmGrid <- expand.grid(
  interaction.depth = c(5,9), 
  n.trees = (5:10)*100, 
  shrinkage = 0.1,
  n.minobsinnode = 20)

# estimate ITR
fit_caret <- estimate_itr(
  treatment = "treatment",
  form = user_formula,
  trControl = fitControl,
  data = star_data,
  algorithms = c("gbm"),
  budget = 0.2,
  split_ratio = 0.7,
  tuneGrid = gbmGrid,
  verbose = FALSE)

# evaluate ITR
est_caret <- evaluate_itr(fit_caret)

```


We can extract the training model from `caret` and check the model performance. Other functions from `caret` can be applied to the training model.

```{r caret_model, message = FALSE, warning = FALSE, fig.width = 6, fig.height = 4}
# extract the final model
caret_model <- fit_caret$estimates$models$gbm
print(caret_model$finalModel)

# check model performance
trellis.par.set(caretTheme()) # theme
plot(caret_model) 
# heatmap 
plot(
  caret_model, 
  plotType = "level",
  scales = list(x = list(rot = 90)))
```

### Train the model with SuperLearner

Alternatively, we can train the model with the `SuperLearner` package (for further information about `SuperLearner`, see [the original website](https://CRAN.R-project.org/package=SuperLearner/vignettes/Guide-to-SuperLearner.html)). SuperLearner utilizes ensemble method by taking optimal weighted average of multiple machine learning algorithms to improve model performance. 

We will compare the performance of the ITR estimated with `causal_forest` and `SuperLearner`.   
  
```{r sl_summary, message = FALSE, warning = FALSE}
library(SuperLearner)

fit_sl <- estimate_itr(
  treatment = "treatment",
  form = user_formula,
  data = star_data,
  algorithms = c("causal_forest","SuperLearner"),
  budget = 0.2,
  split_ratio = 0.7,
  SL_library = c("SL.ranger", "SL.glmnet"))

est_sl <- evaluate_itr(fit_sl)

# summarize estimates
summary(est_sl)
```

We plot the estimated Area Under the Prescriptive Effect Curve for the writing score across a range of budget constraints, seperately for the two ITRs, estimated with `causal_forest` and `SuperLearner`. 

```{r sl_plot, fig.width=6, fig.height=4,fig.align = "center"}
# plot the AUPEC 
plot(est_sl)
```