---
title: "Advanced command line functions"
author: "Arnaud Wolfer"
date: "2019-10-03"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced command line functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

While the following vignette details all command line functions available in `santaR` for reference and potential development work, these are not expected to be used on a day-to-day basis (more details are available in each functions help page).

To analyse time-series data, refered to the [graphical user interface](santaR-GUI.pdf) or the [automated command line functions](automated-command-line.html) which implement these functions.

This vignette will detail the following underlying functions:

- Preparing input data
- DF search functions
- Fitting, Confidence bands and plotting
- P-values calculation


## Preparing input data

As `santaR` is an univariate approach, this vignette will use one variable from the acute inflammation dataset detailed in [How to prepare input data for santaR](prepare-input-data.html).

```{r}
library(santaR)

# data (keep the 3rd variable)
var1_data <- acuteInflammation$data[,3]
# metadata (common to all variables)
var1_meta <- acuteInflammation$meta

# 7 unique time-points
unique(var1_meta$time)
# 8 individuals
unique(var1_meta$ind)
# 2 groups
unique(var1_meta$group)

# 72 measurements for the given variable
var1_data
```


The first step is to generate the input matrix by converting the vector of observation (_y_ response for the variable at one time-point for one individual) into a matrix IND (_row_) x TIME (_column_) using `get_ind_time_matrix()`:

```{r, eval = FALSE}
var1_input  <- get_ind_time_matrix( Yi=var1_data, ind=var1_meta$ind, time=var1_meta$time)
var1_input
```
```{r, results = "asis", echo = FALSE}
var1_input  <- get_ind_time_matrix( Yi=var1_data, ind=var1_meta$ind, time=var1_meta$time)
pander::pandoc.table(var1_input)
```


In order to compare 2 groups, it is necessary to create a grouping matrix that list group membership for all individuals using `get_grouping()`:

```{r, eval = FALSE}
var1_group  <- get_grouping( ind=var1_meta$ind, group=var1_meta$group)
var1_group
```
```{r, results = "asis", echo = FALSE}
var1_group  <- get_grouping( ind=var1_meta$ind, group=var1_meta$group)
pander::pandoc.table(var1_group)
```


## DF search functions

The degree of freedom (_df_) is the parameter that controls how closely each individual's time-trajectory fit eachs data point, balancing the fitting of the raw data and the smoothing of measurements errors. An optimal _df_ value ensures that the spline is not overfitted or underfitted on the measurments. The degree of freedom should be established once for a dataset as it is a factor of 'complexity' of the time-trajectories under study, but does not change with different variables (same metadata, number of time-points,...)

Refer to [santaR theoretical background](theoretical-background.html) and [Selecting an optimal number of degrees of freedom](selecting-optimal-df.html) for more details on _df_ and an intuitive approach for its selection.

In order to assist in the selection of an optimal _df_ and visualise its impact, the following functions:

- extract the eigen-splines across all time-trajectories of all individuals and all variables
- provide metrics to select the _df_ that gives the best fit on the eigen-splines (as a guide value for the whole dataset)

First we extract the eigen-splines across the whole dataset using `get_eigen_spline()`:
```{r}
var_eigen  <- get_eigen_spline( inputData=acuteInflammation$data, ind=acuteInflammation$meta$ind, time=acuteInflammation$meta$time)
```
```{r, eval=FALSE}
# The projection of each eigen-spline at each time-point:
var_eigen$matrix
```
```{r, results = "asis", echo = FALSE}
pander::pandoc.table(var_eigen$matrix)
```
```{r}
# The variance explained by each eigen-spline
var_eigen$variance
# PCA summary
summary(var_eigen$model)
```

It is then possible to estimate the _df_ corresponding to the minimisation of a metric (penalised_residuals cross-validated, penalised_residuals general cross-validation, AIC, BIC or AICc) using `get_eigen_DF()`. The best _df_ can either be averaged over all eigen-splines `df` or weighted by the variance explained by each eigen-spline `wdf`:

```{r, eval = FALSE}
# The projection of each eigen-spline at each time-point:
get_eigen_DF(var_eigen)

# $df
```
```{r, results = "asis", echo = FALSE}
tmpDF <- get_eigen_DF(var_eigen)
pander::pandoc.table(tmpDF$df)
```
```{r, eval = FALSE}
# $wdf
```
```{r, results = "asis", echo = FALSE}
pander::pandoc.table(tmpDF$wdf)
```

The evolution of these metrics (_y_) depending on _df_ (_x_) can be plotted for each eigen-spline using `get_param_evolution()` and `plot_param_evolution()`:
```{r, fig.width = 7, fig.height = 7, dpi = 80}
library(gridExtra)

# generate all the parameter values across df
var_eigen_paramEvo  <- get_param_evolution(var_eigen, step=0.1)

# plot the metric evolution
plot(arrangeGrob(grobs=plot_param_evolution(var_eigen_paramEvo, scaled=FALSE)))

# Scale the metrics for each eigen-spline between 0 and 1
plot(arrangeGrob(grobs=plot_param_evolution(var_eigen_paramEvo, scaled=TRUE)))
```

As we can see, the recommended _df_ can vary widely depending on the metric selected. `get_eigen_DFoverlay_list()` will plot all eigen-projections (green points), a manually selected _df_ (blue line) and automatically fitted _df_ (red line), while grey lines represent splines at 0.2 _df_ intervals (default value):
```{r, fig.width = 8, fig.height =8, dpi = 90}
library(gridExtra)

# plot all eigen-projections
plot(arrangeGrob(grobs=get_eigen_DFoverlay_list(var_eigen, manualDf = 5)))
```
It should be noted that _df=2_ corresponds to a linear model. _df=number(time-points)_ corresponds to a curve that will go through all points (overfitted).

A final factor to take into account is the number of points needed for each individuals depending on the _df_ selected:
  
  - the minimum number of time-points needed is the _df_
  - if for example _df=10_, all individuals that have less than 10 time-points have to be rejected
  
Using `plot_nbTP_histogram()` we can visualise how many samples would have to be rejected for a given _df_. Due to the lack of missing values in the `acuteInflammation` dataset, the plots is not very informative.
```{r, fig.width = 7, fig.height = 5, dpi = 80}
# dfCutOff controls which cut-off is to be applied
plot_nbTP_histogram(var_eigen, dfCutOff=5)
```

As it does not seem to be possible to automatically select the degree of freedom, a choice based on visualisation of the splines while being careful of overfitting, keeping in mind the 'expected' evolution of the underlying process is the most sensible approach.


## Fitting, Confidence bands and plotting

Fitting of each individual and group mean curves are achieved with `santaR_fit()` to generate a `SANTAObj` that is then used for processing:
```{r}
var1 <- santaR_fit(var1_input, df=5, groupin=var1_group)

# it is possible to access the SANTAObj structure, which will be filled in the following steps
var1$properties
var1$general
var1$groups$Group1
```

Confidence bands on the group mean curves can be calculated by bootstrapping using `santaR_CBand()`:
```{r}
var1 <- santaR_CBand(var1)
```

Plot is achieved using `santaR_plot()`, for more details see [Plotting options](plotting-options.html):
```{r, fig.width = 7, fig.height = 5, dpi = 96}
santaR_plot(var1)
```


## P-values calculation

The _p_-values are calculated by the comparison of distance between group mean curves by random sampling of individuals. Due to the stochastic nature of the test, the _p_-value obtained can slighlty vary depending on the random draw. This can be compounded by using the lower and upper confidence range on the _p_-value that is estimated at the same time.

`santaR_pvalue_dist()` will calculate the significance of the difference between two groups:
```{r}
var1 <- santaR_pvalue_dist(var1)
# p-value
var1$general$pval.dist
# lower p-value confidence range
var1$general$pval.dist.l
# upper p-value confidence range
var1$general$pval.dist.u
# curve correlation coefficiant
var1$general$pval.curveCorr
```


## See Also

* [Getting Started with santaR](getting-started.html)
* [How to prepare input data for santaR](prepare-input-data.html)
* [santaR theoretical background](theoretical-background.html)
* [Graphical user interface use](santaR-GUI.pdf)
* [Automated command line analysis](automated-command-line.html)
* [Plotting options](plotting-options.html)
* [Selecting an optimal number of degrees of freedom](selecting-optimal-df.html)