---
title: "GRM Forests for Robust DIF Detection"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{GRM Forests for Robust DIF Detection}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

# Introduction

GRM Forests extend GRM Trees by creating ensembles of trees to provide more robust variable importance measures. This vignette covers:

* GRM Forest implementation

* Variable importance and variable importance plot.

See the vignette on getting started with grmtree package for a more detailed walkthrough of the tree-based graded response theory model (GRMTree).

# Install and Load required packages
To implement the tree-based GRM (GRMTree), you will install the following packages if not previously installed.
```{r, eval=F}
## Install packages from CRAN repository
install.packages(c("dplyr", "grmtree"))
```

Once installed, load the packages as follows:
```{r, message=FALSE, warning=FALSE}
library(dplyr)        # For data manipulation
library(grmtree)      # For tree-based GRM DIF Test
```

# Import and prepare the data

The data set used in this demonstration is a test/sample data for the package.

```{r, message=FALSE}
## Load the data
data("grmtree_data", package = "grmtree")

## Take a glimpse at the data
glimpse(grmtree_data)

## Prepare the data
resp.data <- grmtree_data %>% 
  mutate_at(vars(starts_with("MOS")), as.ordered) %>% 
  mutate_at(vars(c(sex, residency, depressed,
                   Education, job, smoker,
                   multimorbidity)), as.factor) 

## Explore the data
head(resp.data)

## Check the structure of the data
glimpse(resp.data)

## Create response as outcomes
resp.data$resp <- data.matrix(resp.data[, 1:8])
```

# GRM Forests Implementation

## Define the forest control parameters
```{r}
## Get help on the control parameter
# ?grmforest.control

## GRMTree control parameters with Benjamini-Hochberg 
grm_control <- grmtree.control(
  minbucket = 350,
  p_adjust = "BH", alpha = 0.05)

## Define the forest control parameters
forest_control <- grmforest.control(
  n_tree = 3, # Number of trees (Reduced for vignette build time)
  sampling = "bootstrap",  # Bootstrap method; resampling also available
  sample_fraction = 0.632,
  mtry = sqrt(9),  # Usually the square root of the number of covariates
  control = grm_control,
  remove_dead_trees = TRUE, # Remove any null GRMTree
  seed = 123
)
```

## Grow the GRM Forest

```{r, eval=FALSE}
## Fit the GRM forest
mos_forest <- grmforest(
  resp ~ sex + age + bmi + Education + 
  residency + depressed + job + multimorbidity + smoker,  
  data = resp.data,
  control = forest_control
)

## Get the summary of the fitted forest
summary(mos_forest)
print(mos_forest)

## Plot a tree in the forest
plot(mos_forest$trees[[1]])
```

# Variable Importance

## Compute the variable importance of each covariate
```{r, eval=FALSE}
## Calculate the variable importance
importance <- varimp(mos_forest, seed = 123, verbose = T)

## Print the result of the variable importance
print(importance)
```

Example output:
```
           age         smoker            bmi multimorbidity            sex 
     403.07554      220.37908       39.02621       37.00120       32.06389 
     Education      residency      depressed            job 
       0.00000        0.00000        0.00000        0.00000 
```

## Plot the variable importance of each variable 

Here `plot.varimp` creates a bar plot of variable importance scores with options for both ggplot2 and base R graphics.
```{r, eval=FALSE}
## Plot the variable importance scores (ggplot is the default)
plot(importance)

## Plot onlt the top 5 importance variables
plot(importance, xlab = "", top_n = 5)

## Plot the base R version
plot(importance, use_ggplot = FALSE)

## Custom colors
plot(importance, col = c("green", "red"))

## Rename the variable names in the order from the variable importance result
names(importance) <- c("Age", "Smoking Status", "BMI",
                       "Multimorbidity", "Sex", "Education",
                       "Residency", "Depression", "Employment")

## Now create the plot with informative names
plot(importance)
```


# Conclusion

GRM Forests provide more stable DIF detection by aggregating across many trees. Key advantages include robust variable importance measures, reduced overfitting, and better handling of complex interactions.


