---
title: "Motivation"
author: "Dr. Simon Müller"
date: "`r Sys.Date()`"
output:
  html_vignette:
    css: kable.css
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{Motivation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Objective {#objective}

### Define a Distance Measure for Numerical and Categorical Features {#define-a-distance-measure-for-numerical-and-categorical-features}

Let $x_i = \left(x_i^1,\cdots,x_i^p\right)^T$ and
$x_j = \left(x_j^1,\cdots,x_j^p\right)^T$ be the feature vectors of two
distinct patients $i$ and $j$. A first rough idea may be to calculate
the $L_1-$norm of this two feature vectors: $$
\|x_i - x_j\|_{L_1}=\frac{1}{p}\sum\limits_{k=1}^p|x_i^k - x_j^k|
$$

However, this naive approach arises from some problems:

1.  This measure is well-defined for numerical features but not for
    categorical features, such as gender (male/female). We need a method
    to define the distance for categorical variables.

2.  $|x_i^k - x_j^k|$ is not scale-invariant, meaning that if one
    changes the unit of measurement (e.g., from meters to centimeters),
    the contribution of this feature would increase by a factor of 100.
    Features with large values will dominate the distance.

3.  All features are treated equally, which may not reflect their actual
    importance. For example, in the case of lung cancer, the patient's
    smoking status (no/yes) seems more relevant than the city they live
    in.

This package addresses these issues by deriving a weighted distance from regression model coefficients, which naturally handles mixed feature types and reflects each feature's importance.

## Weighted Distance Measure {#weighted-distance-measure}

To address the challenges discussed earlier, we need to perform two
steps:

1.  Generalize the L1-distance to handle different variable types,
    including numerical and categorical features.

2.  Assign appropriate weights to each feature to:

    a\. eliminate the dependency on the scale of the variables,

    b.  account for varying importance across features, and
    c.  consider the impact size of each feature.

By incorporating these modifications, we arrive at the following
weighted distance measure:

$$
d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)|
$$

The weighted distance measure now includes weights
$\alpha(x_i^k, x_j^k)$, which are determined based on the training data.
In the next section, we will discuss how to obtain these weights and
effectively compute the weighted distance measure for mixed data types
and varying feature importance.

# Statistical Model {#statistical-model}

Let $\alpha(x_i^k, x_j^k)$ the weights and $d(x_i^k, x_j^k)$ the
distance for feature $k$ and observation $i$ and $j$. We will define
them as:

If feature $k$ is numerical, then

$$d(x_i^k, x_j^k) = x_i^k - x_j^k$$ and

$$\alpha(x_i^k, x_j^k) = \hat{\beta_k}$$

If feature $k$ is categorical, then

$$d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0$$ and
$$\alpha(x_i^k, x_j^k) = \hat{\beta_k}^i - \hat{\beta_k}^j,$$

where $\hat{\beta_k}^k$ are the coefficients of a regression model
(linear, logistic, or CPH model). The regression coefficients thus serve
as weights that reflect each feature's importance and scale.

# Random Forests {#random-forests}

## Proximity Measure {#proximity-measure}

Two observations $i$ and $j$ are considered more similar when the
fraction of trees in which patient $i$ and $j$ share the same terminal
node is close to one (Breiman, 2002).

$$d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]},$$
where $M$ is the number of trees that contain both observations and $T$
is the total number of trees.

A drawback of this measure is that the
decision is binary, meaning that potentially similar observations might
be counted as dissimilar. For example, suppose a final cut-off is
consistently made around age 58, and observation 1 has an age of 56
while observation 2 has an age of 60. In this case, the distance between
observation 1 and observation 2 would be the same as the distance
between observation 1 and an observation with an age of 80. This
limitation makes the proximity measure less sensitive to small
differences between observations, potentially affecting the overall
analysis of similarity.

## Depth Measure: A modified proximity measure {#depth-measure-a-modified-proximity-measure}

In contrast to the proximity measure, the depth measure takes into
account the number of edges between two observations instead of their
final nodes in each tree. This distance measure is averaged over all
trees and is defined as:

$$
d(x_i, x_j) = \frac{1}{M}\sum\limits_{t=1}^T g_{ij},
$$

where $M$ is the number of trees containing both observations, and
$g_{ij}$ is the number of edges between the end nodes of observation $i$
and $j$ in tree $t$. This measure considers the structure of the trees
and provides a more nuanced understanding of the similarity between
observations.

For more details and a thorough explanation of the depth measure, refer
to the publication by Englund and Verikas: "[A novel approach to
estimate proximity in a random forest: An exploratory
study.](https://doi.org/10.1016/j.eswa.2012.01.157)"

By accounting for tree structure, the depth measure provides a more
nuanced similarity assessment than the binary proximity measure.