---
title: "Developing a Credit Scorecard"
author: "Shichen Xie, Michael Thomas"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Developing a Credit Scorecard}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Traditional Credit Scoring Using Logistic Regression

After installing scorecard via instructions in the [README](https://github.com/ShichenXie/scorecard#Installation) section, load the package into your environment.

```r
library(scorecard)
```

### Data Preparation

Let's use the *germancredit* dataset for the purposes of this demonstration.

```r
data("germancredit")
str(germancredit)
```

The `var_filter` function drops column variables that don't meet the thresholds for missing rate (> 95% by default), information value (IV) (< 0.02 by default), or identical value rate (> 95% by default). 

```r
dt_f <- var_filter(germancredit, y = "creditability")
```

### Split Data into Train / Test Sets

When building scorecard models, a subset of the observations should be held out from the data used to train the model (similar to most other traditional modeling approaches), and instead be apportioned to the *test* set. We can perform this sampling to create the *train* and *test* datasets using the `split_df` function.

```r
dt_list <- split_df(dt_f, y = "creditability", ratios = c(0.6, 0.4), seed = 30)
label_list <- lapply(dt_list, function(x) x$creditability)
```

### Weight-of-Evidence (WoE) binning

Weight-of-Evidence binning is a technique for binning both continuous and categorical independent variables in a way that provides the most robust bifurcation of the data against the dependent variable. This technique can be easily executed across all independent variables using the `woebin` function.

```r
bins <- woebin(dt_f, y = "creditability")
# woebin_plot(bins)
```

The user can also adjust bin breaks interactively by using the `woebin_adj` function.

```r
# breaks_adj <- woebin_adj(dt_f, y = "creditability", bins = bins)
```

Furthermore, the user can set the bin breaks manually via the `breaks_list = list()` argument in the `woebin` function. Note the use of *%,%* as a separator to create a single bin from two classes in a categorical independent variable.

```r
breaks_adj <- list(
  age.in.years = c(26, 35, 40),
  other.debtors.or.guarantors = c("none", "co-applicant%,%guarantor")
)

bins_adj <- woebin(dt_f, y = "creditability", breaks_list = breaks_adj)
```

Once your WoE bins are established for all desired independent variables, apply the binning logic to the training and test datasets.

```r
dt_woe_list <- lapply(dt_list, function(x) woebin_ply(x, bins_adj))
```

### Logistic Regression Example

Logistic regression can often be leveraged effectively to assist in building the scorecards.

```r
m1 <- glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)

# vif(m1, merge_coef = TRUE) # summary(m1)

# Select a formula-based model by AIC (or by LASSO for large dataset)
m_step <- step(m1, direction = "both", trace = FALSE)
m2 <- eval(m_step$call)

# vif(m2, merge_coef = TRUE) # summary(m2)
```

If oversampling is a concern, the following code chunk could be uncommented and run to help adjust for this issue.

```r
# Read documentation on handling oversampling (support.sas.com/kb/22/601.html)

# library(data.table)

# p1 <- 0.03 # bad probability in population 
# r1 <- 0.3 # bad probability in sample dataset

# dt_woe <- copy(dt_woe_list$train)[, weight := ifelse(creditability == 1, p1/r1, (1-p1)/(1-r1) )][]

# fmla <- as.formula(paste("creditability ~", paste(names(coef(m2))[-1], collapse = "+")))
# m3 <- glm(fmla, family = binomial(), data = dt_woe, weights = weight)
```

### Evaluating Model Performance Using KS & ROC

The `perf_eva` function provides model accuracy statistics (such as mse, rmse, logloss, r2, ks, auc, gini) and plots (such as ks, lift, gain, roc, lz, pr, f1, density).

```r
# First, get probabalistic predictions
pred_list <- lapply(dt_woe_list, function(x) predict(m2, x, type = 'response'))
# Then evaluate model accuracy  
perf <- perf_eva(pred = pred_list, label = label_list)
```

### Create Scorecard

Once the model has been selected, scorecards can be created via the `scorecard` function. Note that the default target points is 600, target odds is 1/19 and points to double the odds is 50. See `?scorecard` for more information on the function and its arguments.

The scorecard can then be applied to the original data using the `scorecard_ply` function. Lastly, a chart encompassing Population Stability Index (PSI) statistics can be rendered via the `perf_psi` function.

```r
# Build the card
card <- scorecard(bins_adj, m2)
# Obtain Credit Scores
score_list <- lapply(dt_list, function(x) scorecard_ply(x, card))
# Analyze the PSI
perf_psi(score = score_list, label = label_list)
```