---
title: "Cross-validation with single algorithm"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cross-validation with single algorithm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "../man/figures/README-"
  )

library(dplyr)

load("../data/star.rda")

# specifying the outcome
outcomes <- "g3tlangss"

# specifying the treatment
treatment <- "treatment"

# specifying the data (remove other outcomes)
star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss))

# specifying the formula
user_formula <- as.formula(
  "g3tlangss ~ treatment + gender + race + birthmonth + 
  birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH + 
  GKBUSED + GKWHITE ")
```


When users choose to estimate and evaluate ITR under cross-validation, the package implements Algorithm 1 from [Imai and Li
(2023)](https://arxiv.org/abs/1905.05389) to estimate and evaluate ITR. For more information about Algorithm 1, please refer to the [this page](../articles/paper_alg1.html).


Instead of specifying the `split_ratio` argument, 
we choose the number of folds (`n_folds`). We present an example of estimating ITR with 3 folds cross-validation. 
In practice, we recommend using 10 folds to get a more stable model performance. 


| Input                                                            | R package input                                               | Descriptions                                                                                                                                                         |
|:-----------------------------------------------------------------|:---------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data $\mathbf{Z}=\left\{\mathbf{X}_i, T_i, Y_i\right\}_{i=1}^n$ | `treatment = treatment, form = user_formula, data = star_data` | `treatment` is a character string specifying the treatment variable in the `data`; `form` is a formula specifying the outcome and covariates; and a dataframe `data` |
| Machine learning algorithm $F$                                   | `algorithms = c("causal_forest")`                              | a character vector specifying the ML algorithms to be used                                                                                                           |
| Evaluation metric $\tau_f$                                       | PAPE, PAPD, AUPEC, GATE                                        | By default                                                                                                                                                           |
| Number of folds $K$                                              | `n_folds = 3`                                                  | `n_folds` is a numeric value indicating the number of folds used for cross-validation                                                                                |
| …                                                                | `budget = 0.2`                                                 | `budget` is a numeric value specifying the maximum percentage of population that can be treated under the budget constraint                                          |


```{r cv_estimate, message = FALSE, out.width = '60%'}
library(evalITR)
# estimate ITR 
set.seed(2021)
fit_cv <- estimate_itr(
               treatment = treatment,
               form = user_formula,
               data = star_data,
               algorithms = c("causal_forest"),
               budget = 0.2,
               n_folds = 3)

```

The output will be an object that
includes estimated evaluation metric $\hat{\tau}_F$ and the estimated
variance of $\hat{\tau}_F$ for different metrics (PAPE, PAPD, AUPEC).


```{r cv_eval, message = FALSE, out.width = '50%'}
# evaluate ITR 
est_cv <- evaluate_itr(fit_cv)

# summarize estimates
summary(est_cv)
```