---
title: "Use Case: Linear Regression with Outlier Correction and Bootstrap"
author: "Andreas Brandmaier"  
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Use Case: Linear Regression with Outlier Correction and Bootstrap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette demonstrates how to perform a robust linear regression
analysis on the Boston housing data.  We start by setting global chunk
options and loading the required packages.

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r libs}
library(reproducibleRchunks)
library(MASS)
```

# Load Data

We load the Boston housing data from the `MASS` package and display the
first rows to get an overview of the variables.

```{reproducibleR loadData}
boston <- MASS::Boston
knitr::kable(head(boston))
```

# Outlier Correction

Next, we remove observations with extremely high or low values of
`medv` using the IQR rule to reduce the influence of outliers.

```{reproducibleR outliers}
iqr_medv <- IQR(boston$medv)
q <- quantile(boston$medv, c(0.25, 0.75))
lower <- q[1] - 1.5 * iqr_medv
upper <- q[2] + 1.5 * iqr_medv
boston <- boston[boston$medv >= lower & boston$medv <= upper, ]
```

# Linear Regression

We then fit a simple linear regression predicting median house value
from the percentage of lower status population (`lstat`).

```{reproducibleR linreg}
model <- lm(medv ~ lstat, data = boston)
summary(model)
```

# Bootstrap Coefficients

To assess the variability of the coefficients we perform a simple
bootstrap.  Because analyses based on random numbers lead to different results across
runs, we set a random seed to make this step reproducible.  See
Brandmaier et&nbsp;al. for a detailed discussion of
reproducibility and random numbers
([https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html](https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html)).

```{reproducibleR bootstrap}
set.seed(123)
boot_coefs <- replicate(1000, {
  idx <- sample(nrow(boston), replace = TRUE)
  coef(lm(medv ~ lstat, data = boston[idx, ]))
})
boot_means <- rowMeans(boot_coefs)
boot_means
```

# Plot

Finally, we visualise the relation between `lstat` and `medv` together
with the fitted regression line. Since plots do return objects, they
are not tested for reproducibility.

```{reproducibleR plots}
plot(boston$lstat, boston$medv,
     xlab = "Lower status (% population)",
     ylab = "Median value of homes",
     main = "Boston Housing Data")
abline(model, col = "red")
```
