---
title: "Exploratory Analysis for Micro-Randomized Trial (MRT): Continuous Distal Outcomes"
author: "Tianchen Qian (t.qian@uci.edu)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
link-citations: yes
bibliography: mhealth-ref.bib
csl: biostatistics.csl
vignette: >
  %\VignetteIndexEntry{Exploratory Analysis for Micro-Randomized Trial (MRT): Continuous Distal Outcomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

The `MRTAnalysis` package now supports analysis of distal causal excursion effect of a continuous **distal outcomes** in micro-randomized trials (MRTs), using the function `dcee()`.  
Distal outcomes are measured once at the end of the study (e.g., weight loss, cognitive score), in contrast to **proximal outcomes** which are repeatedly measured after each treatment decision point.

This vignette introduces:

- The data structure and the distal causal excursion effects (DCEE).  
- The usage of the `dcee()` function to estimate DCEE for MRT with a continuous distal outcome.
- Example analyses with synthetic data.  
- Moderated effects, cross-fitting, and machine learning options.  
- Interpretation of the results.

# Data Structure of MRT with Distal Outcomes

In a distal-outcome MRT:

- **Treatment assignment**: At each decision point $t$, $A_{it}$ is randomized with probability $p_{it}$.  
- **Covariates**: $X_{it}$ time-varying covariates and moderators.  
- **Availability**: $I_{it} = 1$ if available, $0$ otherwise.  
- **Outcome**: $Y_i$ distal outcome measured once at end of study.  

Thus, each row in the long-format data corresponds to $(X_{it}, A_{it}, I_{it}, p_{it})$, with $Y_i$ constant within each participant.

## Distal Causal Excursion Effects

The distal causal excursion effects are defined using potential outcomes in @qian2025distal.
Roughly speaking, the DCEE at decision point $t$ is the difference in the outcome $Y_i$ due to assigning treatment $A_{it}=1$ versus $A_{it}=0$ at time $t$, while keeping the past and future treatment assignments according to the randomization probabilities in the MRT (i.e., the MRT policy), and averaging over the covariate history and availability at $t$.

# Example Dataset

This package provides `data_distal_continuous`, a synthetic dataset with:

- `userid`: participant id.  
- `dp`: decision point index.  
- `X`: continuous endogenous covariate.  
- `Z`: binary endogenous covariate.  
- `avail`: availability indicator.  
- `A`: treatment indicator.  
- `prob_A`: randomization probability.  
- `A_lag1`: lag-1 treatment.
- `Y`: continuous distal outcome, identical across rows for same `userid`.  

```{r}
library(MRTAnalysis)
current_options <- options(digits = 3) # save current options for restoring later
head(data_distal_continuous, 10)
```

# Using `dcee()`

## Fully Marginal Effect (no moderators)

In the following function call of `dcee()`, we specify the distal outcome variable by `outcome = "Y"`. We specify the treatment variable by `treatment = "A"`. We specify the time-varying randomization probability by `rand_prob = "prob_A"`. We specify the fully marginal effect as the quantity to be estimated by setting `moderator_formula = ~1`. We use `X` and `Z` as two variables by setting `control_formula = ~logstep_pre30min`. We specify the availability variable by `availability = avail`. We use linear regression for the control regression model (i.e., the Stage-1 nuisance models in the two-stage estimation procedure in @qian2025distal) by setting `control_reg_method = "lm"`.

Note that the estimator for the distal causal excursion effect is consistent even if the control regression model is mis-specified, as long as the treatment randomization probabilities are correctly specified (which will be the case for MRTs). Different control regression methods can be used to improve efficiency.


```{r}
fit_lm <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "lm"
)
summary(fit_lm)
```

The `summary()` function provides the estimated distal causal excursion effect as well as the 95% confidence interval, standard error, and p-value. The only row in the output `Distal causal excursion effect (beta)` is named `Intercept`, indicating that this is the fully marginal effect (like an intercept in the causal effect model). In particular, the estimated marginal distal excursion effect is 0.404, with 95% confidence interval (-0.771, 1.579), and p-value 0.49. The confidence interval and the p-value are based on t-quantiles.

## Moderated Effect

The following code uses `dcee()` to estimate the distal causal excursion effect moderated by the time-varying covariate `Z`. This is achieved by setting `moderator_formula = ~ Z`.

```{r}
fit_mod <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~Z,
    control_formula = ~ Z + X,
    availability = "avail",
    control_reg_method = "lm"
)
summary(fit_mod, lincomb = c(1, 1)) # beta0 + beta1
```
In the above, we asked `summary()` to calculate and print the estimated coefficients for $\beta_0 + \beta_1$, the distal causal excursion effect when the binary variable $Z$ takes value 1, by using the `lincomb` optional argument. This is illustrated by the following code. We set `lincomb = c(1, 1)`, i.e., asks `summary()` to print out $[1, 1] \times (\beta_0, \beta_1)^T = \beta_0 + \beta_1$. The table under `Linear combinations (L * beta)` is the fitted result for this $\beta_0 + \beta_1$ coefficient combination.

## GAM nuisance models

One can use generalized additive models (GAM) for the control regression models by setting `control_reg_method = "gam"`. This may improve efficiency if the relationship between the distal outcome and the covariates is non-linear. One can use `s()` to specify non-linear terms in the `control_formula`. For example, here we use a smooth term for the continuous covariate `X`, by setting `control_formula = ~ s(X) + Z`.


```{r}
fit_gam <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~Z,
    control_formula = ~ s(X) + Z,
    availability = "avail",
    control_reg_method = "gam"
)
summary(fit_gam)
```

## Random Forest / Ranger nuisance

One can also use tree-based methods for the control regression models by setting `control_reg_method = "rf"` (random forest via `randomForest` package) or `control_reg_method = "ranger"` (faster random forest via `ranger` package). This may improve efficiency if the relationship between the distal outcome and the covariates is complex. Note that tree-based methods do not allow specification of smooth terms like `s(X)`. The `control_formula` has to be specified using main terms only. Additional optional arguments can be passed to the underlying random forest function via `...` argument of `dcee()`, which is not shown in this example.


```{r}
fit_rf <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "rf" # can replace "rf" with "ranger" for faster implementation
)
summary(fit_rf)
```

# Cross-Fitting

The `dcee()` function also supports cross-fitting, which may lead to improved finite sample performance when using complex machine learning methods for the control regression models. This is done by setting `cross_fit = TRUE` and specifying the number of folds via `cf_fold`. Here we use 5-fold cross-fitting with generalized additive models for the control regression models as an example. The particular cross-fitting algorithm follows Section 4 in the Web Appendix of @zhong2021aipw.

```{r}
fit_cf <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "gam",
    cross_fit = TRUE, cf_fold = 5
)
summary(fit_cf)
```

# Inspecting Stage-1 Fits

We can set `show_control_fit = TRUE` in the `summary()` function to inspect the control regression (i.e., Stage-1 nuisance) model fits. This is useful for diagnosing the fit of the control regression models. For `lm`/`gam` these include regression summaries. For tree-based or SuperLearner fits, original learner output is shown. To further inspect the control regression model fits, one can manually inspect `$fit$regfit_a0` and `$fit$regfit_a1`.

```{r}
summary(fit_lm, show_control_fit = TRUE)
```

# References
