---
title: "How to prepare input data for santaR"
author: "Arnaud Wolfer"
date: "2019-10-03"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to prepare input data for santaR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The `santaR` package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. It is robust to missing values and noisy measurements without requiring synchronisation in time.

This vignette will:

* Detail the input format expected by the package
* Present the provided example dataset _'acuteInflammation'_
* Save _'acuteInflammation'_ in a `.csv` and `.RData` files to be used as input for the graphical interface tutorial.


## Data format

In short, for a given variable, each measurement (observation) is a row in a vector.

If more than one variable has been measured at a given time, multiple measurement columns can be provided in a Data.Frame (`data`) with observations as rows and variables as columns.

For each data point (row), the following metadata vectors are required (or can be stored in a Data.Frame `metadata`):

* `time`, the time at which the observation has been taken.
* `ind` identifying which subject (individual) is associated with the observation.

Optionally:

* `group` an identifier indicating to which study group the observation belongs.

> All observations of a given individual need to be affected to the same group. If 2 groups exist, significantly altered time trajectories can be identified. If no group or more than 2 groups are provided, the trajectories can be plotted but significance cannot be calculated.

`data` and `metadata` information can be stored as vectors, in one or in two separate Data.Frame. If a data-point is not available (no data value for any variables) the row should be discarded.
If some of the variable measurements are missing for a given time-point, the value can be replace by `NaN`.
Do not inpute data as the package is explicitely designed to be robust to missing values.

Here is an example of `5` observations of `2` variables. Taken on `3` individual separated in `2` goups, covering `3` time-points:
```{r, eval = FALSE}
# Metadata
```
```{r, results = "asis", echo = FALSE}
ind   <- c('ind_1','ind_1','ind_2','ind_2','ind_3')
time  <- c(0, 5, 0, 10, 5)
group <- c('group_A','group_A','group_B','group_B','group_A')
outputMeta <- data.frame(ind, time, group, stringsAsFactors = FALSE)
pander::pandoc.table(outputMeta)
```
```{r, eval = FALSE}
# Data
```
```{r, results = "asis", echo = FALSE}
variable1  <- c(1,3.5, 4,9.5,5)
variable2  <- c(110.2, NaN, 79.1, 132.0, 528.3)
outputData <- data.frame(variable1, variable2, stringsAsFactors = FALSE)
pander::pandoc.table(outputData)
```

## Introducing the dataset _'acuteInflammation'_

The `santaR` package is designed for the analysis of short noisy time-series as produced in most _'-omics'_ platforms, an example of which is provided.
This dataset referred to as `acuteInflammation` contains the concentrations of 22 mediators of inflammation over an episode of acute inflammation. The mediators have been measured at 7 time-points on 8 subjects, concentration values have been unit-variance scaled for each variable.

`acuteInflammation` is stored as two Data.Frame; `meta` for the 56 observations metadata, and `data` for the 22 variables measurements:
```{r, eval = FALSE}
library(santaR)

## Metadata
# number of rows
nrow(acuteInflammation$meta)
# number of columns
ncol(acuteInflammation$meta)
# a subset
acuteInflammation$meta[12:20,]
```
```{r, results = "asis", echo = FALSE}
library(santaR)
nrow(acuteInflammation$meta)
```

```{r, results = "asis", echo = FALSE}
library(santaR)
ncol(acuteInflammation$meta)
```

```{r, results = "asis", echo = FALSE}
library(santaR)
pander::pandoc.table(acuteInflammation$meta[12:20,])
```
```{r, eval = FALSE}
## Data
# number of rows
nrow(acuteInflammation$data)
# number of columns
ncol(acuteInflammation$data)
# a subset
acuteInflammation$data[12:20,1:4]
```
```{r, results = "asis", echo = FALSE}
library(santaR)
nrow(acuteInflammation$data)
```

```{r, results = "asis", echo = FALSE}
library(santaR)
ncol(acuteInflammation$data)
```
```{r, results = "asis", echo = FALSE}
library(santaR)
pander::pandoc.table(acuteInflammation$data[12:20,1:4])
```


## Preparing the csv input for the graphical user interface

While the command line functions accept Data.Frame and vectors as input, the graphical user interface will read a `.csv` file.

By concatenating `acuteInflammation`'s `data` and `metadata` tables and saving them in a `.csv` file, we can prepare the input dataset for the graphical user interface tutorial:

```{r, eval = FALSE}
library(santaR)

# Concatenate
outputTable <- cbind(acuteInflammation$meta, acuteInflammation$data)

# Save to disk
outputPath = file.path('path_to_my_output_folder', 'acuteInflammation_GUI_demo.csv') 
write.csv(outputTable, file=outputPath, row.names=FALSE)
```

It is also possible to provide the data directly as 2 Data.Frames stored in a `.RData` file; containing the data in a DataFrame named `inData` and metadata in a DataFrame named `inMeta`:

```{r, eval = FALSE}
library(santaR)

# Rename datasets
inMeta <- acuteInflammation$meta
inData <- acuteInflammation$data
			
# Save to disk
outputPath = file.path('path_to_my_output_folder', 'acuteInflammation_GUI_demo.rdata') 
save(inMeta, inData, file=outputPath, compress=TRUE)
```


## See Also

* [Getting Started with santaR](getting-started.html)
* [santaR theoretical background](theoretical-background.html)
* [Graphical user interface use](santaR-GUI.pdf)
* [Automated command line analysis](automated-command-line.html)
* [Plotting options](plotting-options.html)
* [Selecting an optimal number of degrees of freedom](selecting-optimal-df.html)
* [Advanced command line options](advanced-command-line-functions.html)