---
title: "Parallel SEM Forests with the future Package"
author: "Andreas M. Brandmaier"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parallel SEM Forests with the future Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette shows how to run SEM forests in parallel using the [`future`](https://future.futureverse.org/) package.
The future package provides a standardized way of evaluating R expressions asynchronously using the computing resources available to the user -- whether it is a single machine with multiple cores or a computing cluster.

We build on the latent growth curve model from the *Getting Started* vignette and use a `future::plan()` to distribute trees across the available CPU cores of a local machine.

## Packages

```{r}
library(semtree)
library(OpenMx)
library(future)
```

## Simulate the latent growth curve data

We reuse the linear latent growth curve example: each individual has five repeated
measures (`X1` to `X5`) and one dichotomous predictor (`P1`) that influences the
mean slope. Setting a seed keeps the simulation reproducible.

```{r}
set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
p1 <- sample(c(0, 1), size = N, replace = TRUE)
loadings <- 0:4
x <-
  (slope + p1 * 5) %*% t(loadings) +
  matrix(rep(icept, each = M), byrow = TRUE, ncol = M) +
  rnorm(N * M, sd = .08)
growth.data <- data.frame(x, factor(p1), factor(sample(c(0,1),N,TRUE)))
names(growth.data) <- c(paste0("X", 1:M), "P1","Noise")
```

## Specify and fit the SEM

The model is identical to the one used in the introductory vignette: a two-factor
latent growth curve with fixed factor loadings for the intercept and slope.

```{r}
manifests <- names(growth.data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
    type="RAM",
       manifestVars=manifests,
    latentVars=c("intercept","slope"),
    mxData(growth.data, type="raw"),
    # residual variances
    mxPath(
        from=manifests,
        arrows=2,
        free=TRUE,
        values = c(.1, .1, .1, .1, .1),
        labels=c("residual","residual","residual","residual","residual")
    ),
    # latent variances and covariance
    mxPath(
        from=c("intercept","slope"),
        arrows=2,
        connect="unique.pairs",
        free=TRUE,
        values=c(2, 0, 1),
        labels=c("vari", "cov", "vars")
    ),
    # intercept loadings
    mxPath(
        from="intercept",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(1, 1, 1, 1, 1)
    ),
    # slope loadings
    mxPath(
        from="slope",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(0, 1, 2, 3, 4)
    ),
    # manifest means
    mxPath(
        from="one",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(0, 0, 0, 0, 0)
    ),
    # latent means
    mxPath(
        from="one",
        to=c("intercept", "slope"),
        arrows=1,
        free=TRUE,
        values=c(1, 1),
        labels=c("meani", "means")
    )
) # close model

growthCurveModel <- mxRun(growthCurveModel)
```

## Enable parallelism with `future`

`semtree::semforest()` uses the `future.apply` infrastructure internally, so registering a parallel plan automatically parallelizes tree construction. Here we choose a conservative number of workers to keep vignette builds light-weight, that is, a maximum of two workers (if available), and we restore the previous plan when we're done. 
Please see the documentation of `future` for other plans, such as `sequential` for non-parallel execution or `cluster` for using a computing cluster.

```{r}
old_plan <- future::plan()
on.exit(future::plan(old_plan), add = TRUE)

workers <- max(1, min(2, future::availableCores()))
future::plan(future::multisession, workers = workers)
```

## Grow the SEM forest in parallel

With the plan in place, we can grow a forest. We keep the number of trees modest
for the sake of a fast example; in practice, increase `num.trees` (e.g., 200--500)
for stable variable importance estimates.

```{r runforest, message=FALSE, warning=FALSE, echo=FALSE}
set.seed(99)
control <- semforest_control(num.trees = 20, 
                             mtry = 1)
forest <- semforest(
  model = growthCurveModel,
  data = growth.data,
  control = control
)
```

## Inspect variable importance

Because the latent growth data only include the grouping predictor `P1`,
importance results confirm that this variable drives heterogeneity in the model
parameters. Parallel evaluation speeds up the permutation step when multiple
predictors are present.

```{r}
vim <- varimp(forest)
print(vim, sort.values = TRUE)
plot(vim)
```

After the forest is built, `future::plan(old_plan)` (triggered via `on.exit`) returns the
session to its prior parallel configuration.
