---
title: "Scoring via random forests"
vignette: >
  %\VignetteIndexEntry{Scoring via random forests}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
---

```{r}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

> ⚠️ **work-in-progress**

```{r}
#| label: start
#| include: false

library(filtro)
library(dplyr)
library(modeldata)
```

We'll need to load a few packages: 

```{r}
#| label: setup
library(filtro)
library(dplyr)
library(modeldata)
```

## Score class objects

Predictor importance can be assessed using three different random forest models. They can be accessed via the following score class objects:

```{r}
#| eval: false
score_imp_rf
score_imp_rf_conditional
score_imp_rf_oblique
```

These models are powered by the following packages:

```{r}
#| echo: false
score_imp_rf@engine
score_imp_rf_conditional@engine
score_imp_rf_oblique@engine
```

Regarding score types:

- The {ranger} random forest computes the importance scores.

- The {partykit} conditional random forest computes the conditional importance scores.

- The {aorsf} oblique random forest computes the permutation importance scores.

## A scoring example — random forest 

The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable `class` (the outcome). 

Since `case` is only used to indicate Train/Test, not for data analysis, it will be set to `NULL`. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations.

```{r}
cells_subset <- modeldata::cells |> 
  # Use a small example for efficiency
  dplyr::slice(1:50)
cells_subset$case <- NULL

# cells_subset |> str() # Uncomment to see the structure of the data
```

First, we create a score class object to specify a {ranger} random forest, and then use the `fit()` method with the standard formula to compute the importance scores.

```{r}
# Specify random forest and fit score
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset, 
    seed = 42 
  )
```

The data frame of results can be accessed via `object@results`. 

```{r}
cells_imp_rf_res@results
```

A copule of notes here: 

The random forest filter, including all three types of random forests, 

- regression tasks, and

- classificaiton tasks.

In case where `NA` is produced, a safe value can be used to retain the predictor, and can be accessed via `object@fallback_value`. 

Larger values indicate more important predictors. 

For this specific filter, i.e., `score_imp_rf_*`, case weights are supported. 

## Hyperparameter tuning

Like {parsnip}, the argument names are harmonized. For example, the arguments to set the number of trees: `num.trees` in {ranger}, `ntree` in {partykit}, and `n_tree` in {aorsf} are all standardized to a single name, `trees`, so users only need to remember a single name. 

The same applies to the number of variables to split at each node, `mtry`, and the minimum node size for splitting, `min_n`.

```{r}
#| eval: false
# Set hyperparameters
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100, 
    mtry = 2,
    min_n = 1
  )
```

However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard `set.seed()` method, we would use the `seed` argument.

```{r}
#| eval: false
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100,
    mtry = 2,
    min_n = 1, 
    seed = 42 # Set seed for reproducibility
  )
```

## Seamless argument support

If users use {ranger} argument names, intentionally or not, it still works. We have handled the necessary adjustments. The following code chunk can be used to obtain a fitted score: 

```{r}
#| eval: false
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    num.trees = 100,
    mtry = 2,
    min.node.size = 1, 
    seed = 42 
  )
```

The same applies to {partykit}- and {aorsf}- specific arguments. 

## A scoring example — conditional random forest 

For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the `fit()` method to compute the importance scores.

The data frame of results can be accessed via `object@results`. 

```{r}
# Set seed for reproducibility
set.seed(42)

# Specify conditional random forest and fit score
cells_imp_rf_conditional_res <- score_imp_rf_conditional |>
  fit(class ~ ., data = cells_subset, trees = 100)
cells_imp_rf_conditional_res@results
```

Note that when a predictor’s importance score is 0, `partykit::cforest()` may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors.

## An scoring example — oblique random forest 

For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the `fit()` method to compute the importance scores.

The data frame of results can be accessed via `object@results`. 

```{r}
# Set seed for reproducibility
set.seed(42)

# Specify oblique random forest and fit score
cells_imp_rf_oblique_res <- score_imp_rf_oblique |>
  fit(class ~ ., data = cells_subset, trees = 100, mtry = 2)
cells_imp_rf_oblique_res@results
```

## Available objects and engines

The list of score class objects for random forests, their corresponding engines and supported tasks:

```{r}
#| echo: false
#| message: false
knitr::kable(
  data.frame(
    "object" = c("`score_imp_rf`", "`score_imp_rf_conditional`", "`score_imp_rf_oblique`"),
    "engine" = c("`ranger::ranger`", "`partykit::cforest`", "`aorsf::orsf`"),
    "task" = rep(c("regression, classification"), 3)
  )
)

```

