---
title: "Getting started with bigPLScox"
shorttitle: "Getting started with bigPLScox"
author:
- name: "Frédéric Bertrand"
  affiliation:
  - Cedric, Cnam, Paris
  email: frederic.bertrand@lecnam.net
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Getting started with bigPLScox}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "figures/getting-started-",
  fig.width = 7,
  fig.height = 4.5,
  dpi = 150,
  message = FALSE,
  warning = FALSE
)

LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE")
```

# Why bigPLScox?

**bigPLScox** implements Partial Least Squares (PLS) extensions of the Cox
proportional hazards model that remain stable in high-dimensional survival
settings. The package now ships with three complementary engines:

* `coxgpls()` – the classical matrix-based estimator with optional sparse and
  grouped variants (e.g. `coxsgpls()`, `coxspls_sgpls()`).
* `big_pls_cox_fast()` – the new Armadillo backend that produces variance-one
  components and supports both dense matrices and `bigmemory::big.matrix`
  objects through the same interface.
* `big_pls_cox_gd()` – a stochastic gradient descent solver that streams data
  from disk and converges quickly even when only a subset of predictors remains
  active at each step.

Together with the deviance-residual solvers and cross-validation helpers, these
functions cover the most common modelling workflows. This vignette walks through
those workflows using the bundled allelotyping dataset.

# Loading the example data

```{r load-data, cache=TRUE, eval=LOCAL}
library(bigPLScox)

data(micro.censure)
data(Xmicro.censure_compl_imp)

Y_all <- micro.censure$survyear[1:80]
status_all <- micro.censure$DC[1:80]
X_all <- apply(
  as.matrix(Xmicro.censure_compl_imp),
  MARGIN = 2,
  FUN = as.numeric
)[1:80, ]

set.seed(2024)
train_id <- 1:60
test_id <- 61:80

X_train <- X_all[train_id, ]
X_test <- X_all[test_id, ]
Y_train <- Y_all[train_id]
Y_test <- Y_all[test_id]
status_train <- status_all[train_id]
status_test <- status_all[test_id]
```

The original factor-based design matrix is also available if you wish to
leverage the formula interface.

```{r original-design, cache=TRUE, eval=LOCAL}
X_train_raw <- Xmicro.censure_compl_imp[train_id, ]
X_test_raw <- Xmicro.censure_compl_imp[test_id, ]
```

# Inspecting deviance residuals

Deviance residuals can reveal problematic observations before any components are
extracted. The helper `computeDR()` exposes both an R and a C++ engine; the
latter is substantially faster and powers the new fast PLS routines.

```{r deviance-residuals, cache=TRUE, eval=LOCAL}
residuals_overview <- computeDR(Y_train, status_train, plot = TRUE)
eta_null <- rep(0, length(Y_train))
head(residuals_overview)

if (requireNamespace("bench", quietly = TRUE)) {
  benchmark_dr <- bench::mark(
    survival = computeDR(Y_train, status_train, engine = "survival"),
    cpp = computeDR(Y_train, status_train, engine = "cpp", eta = eta_null),
    iterations = 10,
    check = FALSE
  )
  benchmark_dr
}

all.equal(
  as.numeric(computeDR(Y_train, status_train, engine = "survival")),
  as.numeric(computeDR(Y_train, status_train, engine = "cpp", eta = eta_null)),
  tolerance = 1e-7
)
```

# Matrix-based PLS–Cox models

The matrix interface mimics `survival::coxph()` while augmenting the predictor
space with latent components. Most users start with `coxgpls()` and the
cross-validation helper `cv.coxgpls()`.

```{r fit-coxgpls, cache=TRUE, eval=LOCAL}
set.seed(123)
cox_pls_fit <- coxgpls(
  Xplan = X_train,
  time = Y_train,
  status = status_train,
  ncomp = 6,
  ind.block.x = c(3, 10, 20)
)
cox_pls_fit
```

The formula interface accepts data frames containing the predictors along with
survival outcomes.

```{r fit-formula, cache=TRUE, eval=LOCAL}
cox_pls_fit_formula <- coxgpls(
  ~ ., Y_train, status_train,
  ncomp = 6,
  ind.block.x = c(3, 10, 20),
  dataXplan = data.frame(X_train_raw)
)
cox_pls_fit_formula
```

## Cross-validation

Repeated cross-validation stabilises the choice of latent components. The
returned object records the optimal number of components and diagnostic curves.

```{r cv-coxgpls, cache=TRUE, eval=LOCAL}
set.seed(123456)
cv_results <- suppressWarnings(cv.coxgpls(
  list(x = X_train, time = Y_train, status = status_train),
  nt = 6,
  ind.block.x = c(3, 10, 20)
))
cv_results
```

Use the selected number of components to refit with the deviance-residual solver
for a robustness check.

```{r fit-coxgplsdr, cache=TRUE, eval=LOCAL}
set.seed(123456)
cox_pls_dr <- coxgplsDR(
  Xplan = X_train,
  time = Y_train,
  status = status_train,
  ncomp = cv_results$nt,
  ind.block.x = c(3, 10, 20)
)
cox_pls_dr
```

Sparse and structured sparse variants (`coxsgpls()`, `coxspls_sgpls()`) share the
same workflow with additional arguments that control the number of selected
predictors per component (`keepX`) or penalty strength.

# Fast solvers for medium-sized data

The new `big_pls_cox_fast()` routine exposes identical arguments for dense
matrices and `big.matrix` objects. On moderate data it serves as a drop-in
replacement for the original R implementation `big_pls_cox()`.

```{r fast-dense, cache=TRUE, eval=LOCAL}
fast_fit_dense <- big_pls_cox_fast(
  X = X_train,
  time = Y_train,
  status = status_train,
  ncomp = 4
)
summary(fast_fit_dense)
```

Predictions rely on the same `predict()` interface used by the classical
function.

```{r fast-dense-predict, cache=TRUE, eval=LOCAL}
linear_predictor_fast <- predict(fast_fit_dense, newdata = X_test, type = "link")
head(linear_predictor_fast)
```

For comparison, the legacy solver is still available. The C++ backend usually
reduces runtime by an order of magnitude while delivering components scaled to
variance one.

```{r classic-dense, cache=TRUE, eval=LOCAL}
legacy_fit_dense <- big_pls_cox(
  X = X_train,
  time = Y_train,
  status = status_train,
  ncomp = 4
)
legacy_fit_dense$cox_fit
```

## Fast PLS Cox vs gradient based PLS Cox

The package provides two families of PLS Cox estimators:

- `big_pls_cox_fast`  
  Exact PLS components computed in compiled code, with a single Cox fit at the end.
- `big_pls_cox_gd`  
  A gradient based Cox optimisation performed in the latent PLS space.

Both approaches share the same preprocessing (centering and scaling), the same PLS deflation scheme and the same prediction interface. The main difference is how the final Cox coefficients are obtained.

In the exact fast path, we build PLS components that are tailored to Cox score residuals, then fit a Cox model once on the resulting scores. In the gradient based path, we optimise the partial log-likelihood directly in the space spanned by the PLS scores, using one of several optimisers.

The code below illustrates a typical benchmark on a moderate survival data set. The design is split into a training set used to fit the models and a test set used to evaluate concordance.

```{r fastvsgd, cache=TRUE, eval=LOCAL}
set.seed(2024)
# Exact fast PLS Cox (dense)
fast_fit_dense <- big_pls_cox_fast(
  X     = X_train,
  time  = Y_train,
  status = status_train,
  ncomp = 4
)

# Exact fast PLS Cox (big.matrix)
if (requireNamespace("bigmemory", quietly = TRUE)) {
  library(bigmemory)
  X_big_train <- bigmemory::as.big.matrix(X_train)
  X_big_test  <- bigmemory::as.big.matrix(X_test)

  fast_fit_big <- big_pls_cox_fast(
    X      = X_big_train,
    time   = Y_train,
    status = status_train,
    ncomp  = 4
  )

  # Gradient based fit in the same latent space
  gd_fit <- big_pls_cox_gd(
    X        = X_big_train,
    time     = Y_train,
    status   = status_train,
    ncomp    = 4,
    max_iter = 2000
  )

  risk_table <- data.frame(
    subject   = seq_along(test_id),
    fast_dense = predict(fast_fit_dense, newdata = X_test, type = "link"),
    fast_big   = predict(fast_fit_big,   newdata = X_big_test, type = "link"),
    gd         = predict(gd_fit,         newdata = X_big_test, type = "link")
  )

  concordances <- apply(
    risk_table[-1],
    2,
    function(lp) {
      survival::concordance(
        survival::Surv(Y_test, status_test) ~ lp
      )$concordance
    }
  )

  concordances
}
```

`big_pls_cox_gd()` is particularly useful for streamed data or when the number
of active predictors per component is restricted via `keepX`.

# Predictions and evaluation

All solvers return latent scores and loadings that can be reused for plotting or
external validation. Use `predict(..., type = "components")` to extract the
scores directly.

```{r prediction-components, cache=TRUE, eval=LOCAL}
if (exists("fast_fit_dense")) {
  component_scores <- predict(fast_fit_dense, newdata = X_test, type = "components")
  head(component_scores)
}
```

The concordance index provides a quick check of predictive ability on the test
set.

```{r concordance, cache=TRUE, eval=LOCAL}
if (exists("fast_fit_dense")) {
  concordance_fast <- survival::concordance(
    survival::Surv(Y_test, status_test) ~ linear_predictor_fast
  )$concordance
  concordance_fast
}
```

# DK-splines extension

For flexible baseline hazards the `coxDKgplsDR()` estimator augments the PLS
components with DK-splines. The interface mirrors the previous functions.

```{r dk-splines, cache=TRUE, eval=LOCAL}
cox_DKsplsDR_fit <- coxDKgplsDR(
  Xplan = X_train,
  time = Y_train,
  status = status_train,
  ncomp = 6,
  validation = "CV",
  ind.block.x = c(3, 10, 20),
  verbose = FALSE
)
cox_DKsplsDR_fit
```

Cross-validation is available for the DK-splines estimator as well.

```{r cv-dk-splines, cache=TRUE, eval=LOCAL}
set.seed(2468)
cv_coxDKgplsDR_res <- suppressWarnings(cv.coxDKgplsDR(
  list(x = X_train, time = Y_train, status = status_train),
  nt = 6,
  ind.block.x = c(3, 10, 20)
))
cv_coxDKgplsDR_res
```


# Next steps

* `vignette("bigPLScox", package = "bigPLScox")` dives deeper into the fast
  big-memory backends and shows how to reconcile the dense and streaming
  implementations.
* `vignette("bigPLScox-benchmarking", package = "bigPLScox")` demonstrates a
  reproducible benchmarking workflow that contrasts the classical, fast, and
  gradient-descent solvers against `survival::coxph()`.
* The reference documentation (`help("big_pls_cox_fast")`, etc.) details every
  argument and return value for the modelling functions discussed above.
