---
title: "Pediatric cancers"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Pediatric cancers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE}
library(dplyr)
library(ggplot2)
library(recipes)
library(scimo)

theme_set(theme_light())

data("pedcan_expression")
```

## Dataset

`pedcan_expression` contains the expression of 108 cell lines from 5 different pediatric cancers. Additionally, it includes information on the sex of the original donor, the type of cancer it represents, and whether it is a primary tumor or a metastasis.

```{r}
pedcan_expression
```

```{r}
count(pedcan_expression, disease, sort = TRUE)
```


## Dimension reduction

One approach to exploring this dataset is by performing PCA.

```{r}
rec_naive_pca <-
  recipe(~., data = pedcan_expression) %>%
  step_zv(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors()) %>%
  prep()

rec_naive_pca %>%
  bake(new_data = NULL) %>%
  ggplot() +
  aes(x = PC1, y = PC2, color = disease) +
  geom_point()
```

To improve the appearance of PCA, one can precede it with a feature selection step based on the coefficient of variation. Here, `step_select_cv` keeps only one fourth of the original features.

```{r}
rec_cv_pca <-
  recipe(~., data = pedcan_expression) %>%
  step_select_cv(all_numeric_predictors(), prop_kept = 1 / 4) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors()) %>%
  prep()

rec_cv_pca %>%
  bake(new_data = NULL) %>%
  ggplot() +
  aes(x = PC1, y = PC2, color = disease) +
  geom_point()
```

The `tidy` method allows to see which features are kept.

```{r}
tidy(rec_cv_pca, 1)
```