---
title: "Imputation Method based on xgboost"
author: "Birgit Karlhuber"
date: "2024-07-08"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Imputation Method based on xgboost}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height=4,
  fig.align = "center"
)
```


This vignette showcases the function `xgboostImpute()`, which can be used to impute missing values based on a random forest model using `[xgboost::xgboost()]. 

### Data

The following example demonstrates the functionality of `xgboostImpute()` using a subset of `sleep`. The columns have been selected deliberately to include some interactions between the missing values

```{r, message=FALSE}
library(VIM)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] # dataset with missings
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)
str(dataset)
```


## Imputation

In order to invoke the imputation methods, a formula is used to specify which
variables are to be estimated and which variables should be used as regressors.First `Dream` will be imputed based on `BodyWgt`. 

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(formula=Dream~BodyWgt,data = dataset)
aggr(imp_xgboost, delimiter = "_imp")

```

The plot shows that all missing values of the variable `Dream` were imputed by the `xgboostImpute()` function. 


## Diagnosing the result

As we can see in the next plot, the correlation structure of `Dream` and
`BodyWgt` is preserved by the imputation method. 

```{r, fig.height=5}
imp_xgboost[, c("Dream", "BodyWgt", "Dream_imp")] |> 
  marginplot(delimiter = "_imp")

```


## Imputing multiple variables

To impute several variables at once, the formula can be specified with more than one column name on the
left hand side.

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(Dream+NonD+Span~BodyWgt,data=dataset)
aggr(imp_xgboost, delimiter = "_imp")

```


## Performance of method

In order to validate the performance of `xgboostImpute()` the `iris` dataset is used. Firstly, some values are randomly set to `NA`. 

```{r}
library(reactable)

data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 48
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = FALSE),
                col=sample(rep(1:4,12)))
df[as.matrix(y)]<-NA

aggr(df)
sapply(df, function(x)sum(is.na(x)))
```

We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and `Species` serves as a regressor.

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_xgboost, delimiter = "imp")

```

The plot indicates that all missing values have been imputed by the `xgboostImpute()` algorithm. The following table displays the rounded first five results of the imputation for all variables.  

```{r echo=F,warning=F}
results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]),
                 "IMPUTED1" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==1),])]),2),
                 "TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]),
                 "IMPUTED2" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==2),])]),2),
                 "TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]),
                 "IMPUTED3" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==3),])]),2),
                 "TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]),
                 "IMPUTED4" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==4),])]),2))[1:5,]

reactable(results, columns = list(
    TRUE1 = colDef(name = "True"),
    IMPUTED1 = colDef(name = "Imputed"),
    TRUE2 = colDef(name = "True"),
    IMPUTED2 = colDef(name = "Imputed"),
    TRUE3 = colDef(name = "True"),
    IMPUTED3 = colDef(name = "Imputed"),
    TRUE4 = colDef(name = "True"),
    IMPUTED4 = colDef(name = "Imputed")
  ),
    columnGroups = list(
    colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
    colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
    colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
    colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4"))
  ),
  striped = TRUE,
  highlight = TRUE,
  bordered = TRUE
)

```
