---
title: "Cross-validation and Information Criteria in bigPLSR"
shorttitle: "Cross-validation and Information Criteria in bigPLSR"
author:
- name: "Frédéric Bertrand"
  affiliation:
  - Cedric, Cnam, Paris
  email: frederic.bertrand@lecnam.net
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Cross-validation and Information Criteria in bigPLSR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup_ops, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "figures/cv-ic-",
  fig.width = 7,
  fig.height = 5,
  dpi = 150,
  message = FALSE,
  warning = FALSE
)

LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE")
set.seed(2025)
```

## Overview

This vignette illustrates how to evaluate partial least squares (PLS) models
with repeated cross-validation and information criteria using the new parallel
helpers available in `bigPLSR`.

We generate a small synthetic data set so the examples run quickly even when the
vignette is built during package installation.

```{r data}
library(bigPLSR)
n <- 120; p <- 8
X <- matrix(rnorm(n * p), n, p)
eta <- X[, 1] - 0.8 * X[, 2] + 0.5 * X[, 3]
y <- eta + rnorm(n, sd = 0.4)
```

## Cross-validation

The `pls_cross_validate()` function now accepts a `parallel` argument. Setting
`parallel = "future"` evaluates the folds concurrently by relying on the
[`future`](https://future.futureverse.org/) ecosystem. You are free to configure
any execution plan you like before calling the helper. Below we keep the
sequential default to avoid introducing run-time dependencies during the build
process.

```{r cv, eval=LOCAL, cache=TRUE}
cv_res <- pls_cross_validate(X, y, ncomp = 4, folds = 6,
                             metrics = c("rmse", "r2"),
                             parallel = "none")
head(cv_res$details)
```

Aggregating the metrics provides a quick overview of the predictive performance
per number of components:

```{r cv-summary, eval=LOCAL, cache=TRUE}
cv_res$summary
```

The cross-validation table is convenient for downstream selection. For example,
we can pick the component count that minimises the RMSE:

```{r cv-select, eval=LOCAL, cache=TRUE}
pls_cv_select(cv_res, metric = "rmse")
```

## Information criteria

Information criteria complement cross-validation by trading off goodness of fit
with model complexity. The helper `pls_information_criteria()` computes the RSS,
RMSE, AIC and BIC across components:

```{r ic, eval=LOCAL, cache=TRUE}
fit <- pls_fit(X, y, ncomp = 4, scores = "r")
ic_tbl <- pls_information_criteria(fit, X, y)
ic_tbl
```

For convenience the wrapper `pls_select_components()` selects the best
components according to the requested criteria:

```{r ic-select, eval=LOCAL, cache=TRUE}
pls_select_components(fit, X, y, criteria = c("aic", "bic"))
```

## Parallel execution with `future`

If you wish to parallelise cross-validation, configure a plan before calling the
helper. The example below assumes a multicore environment and therefore is not
run during vignette building:

```{r future-example, eval=FALSE}
future::plan(future::multisession, workers = 2)
cv_parallel <- pls_cross_validate(X, y, ncomp = 4, folds = 6,
                                  metrics = c("rmse", "mae"),
                                  parallel = "future",
                                  future_seed = TRUE)
future::plan(future::sequential)
```

The `future_seed` argument ensures reproducible bootstrap samples even when
multiple workers are used.

## Summary

The refreshed cross-validation workflow exposes a consistent interface for
sequential and parallel execution, while the information-criteria helpers offer
another perspective on component selection. The combination lets you
systematically tune your PLS models for both accuracy and parsimony.