---
title: "sfn_data classes"
author: "Victor Granda (Sapfluxnet Team)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{sfn_data classes}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# `sfn_data`

`sfn_data` is an S4 class designed for store and interact with sap flow data
at individual plant level (not raw data), primarily from Sapfluxnet project
sites data and metadata.

## S4 Slots

`sfn_data` class has twelve different slots:

  1. **sapf_data**: Tibble containing the sap flow data, each column representing
     an individual tree, **without** any TIMESTAMP variable.
  
  2. **env_data**: Tibble containing the environmental data, each column an
     environmental variable, **without** any TIMESTAMP variable. It must have
     the same `nrow` thet `sapf_data`, in order to be able to combine both in
     further analyses or aggregations.
  
  3. **sapf_flags**: Tibble with the same dimensions as `sapf_data`, containing
     the flags (special remarks indicating possible outliers or any annotation
     of interest) for each observation in `sapf_data`.
  
  4. **env_flags**: Tibble with the same dimensions as `env_data`, containing
     the flags for each observation in `env_data`.
  
  5. **si_code**: Character vector of length 1 with the site code. Useful for
     further analyses or aggregations in order to identify the site when working
     with more then one.
  
  6. **timestamp**: POSIXct vector of length equal to `nrow(sapf_data)` with the
     timestamp values.
  
  7. **solar_timestamp**: POSIXct vector of length equal to `nrow(sapf_data)`
     with the apparent solar timestamp.
  
  8. **site_md**: Tibble with the site metadata. See `sfn_vars_to_filter()` for
     a list of possible metadata variables. This variables are not mandatory,
     and new ones can be added.
  
  9. **stand_md**: Tibble with the stand metadata. See `sfn_vars_to_filter()`
     for a list of possible metadata variables. This variables are not
     mandatory, and new ones can be added.
  
  10. **species_md**: Tibble with the species metadata. See
      `sfn_vars_to_filter()` for a list of possible metadata variables. This
      variables are not mandatory, and new ones can be added.
  
  11. **plant_md**: Tibble with the plant metadata. See `sfn_vars_to_filter()`
      for a list of possible metadata variables. This variables are not
      mandatory, and new ones can be added.
  
  12. **env_md**: Tibble with the environmental metadata. See
      `sfn_vars_to_filter()` for a list of possible metadata variables. This
      variables are not mandatory, and new ones can be added.

### `sfn_data` design

The schematics of the class are summarised in Fig 1.

![sfn_data schematics](resources/schematics.svg.png)

This design have some characteristics:

  + **Dimensions**: Dimensions must comply:  
  
    `nrow(sapf_data) == nrow(env_data) == nrow(sapf_flags) == nrow(env_flags) == length(timestamp) == length(solar_timestamp)`  
    
    `ncol(sapf_data) == ncol(sapf_flags)`  
    
    `ncol(env_data) == ncol(env_flags)`  
    
    In the case of differences in the dimensions between environmental and
    sap flow data, NAs must be added in the corresponding places when equalising
    both TIMESTAMPs (i.e. if env data is longer then sapf data, rows are added
    to the later with NAs to make them equal, and flags ("NA_ADDED") are raised
    to indicate the adding).

  + **si_code slot**: This slot indicates the site. This is useful for
    identifying the site in the metrics functions results.
  
  + **Compartmentalization**: `timestamp` and `solar_timestamp` are isolated
    from data (sapf and env) in their own slot. This made easier subsetting the
    object and open the possibility of working with the timestamps only.
  
  + **Flags**: Two slots are used for storing data flags. As the flag slots
    have the same dimensions as the corresponding dataset slot, selection of
    flagged data is straightforward:  
    
    `sapf_data[flag_data == "flag", ]`


## Methods

Some methods are included in the class:

  + `show`: calling a sfn_data object shows info about it
  
```{r show}
library(sapfluxnetr)
data('ARG_TRE', package = 'sapfluxnetr')
ARG_TRE
```
  
  + `get methods`: methods to get any slot outside the object. In the case
    of sapf and env data, TIMESTAMP is added (based on the `solar` argument) in
    order to obtain fully functional datasets. See `?sfn_get_methods` for more
    details.
    
```{r get_methods}
get_sapf_data(ARG_TRE, solar = FALSE)
get_sapf_data(ARG_TRE, solar = TRUE)
get_env_data(ARG_TRE) # solar default is FALSE
get_sapf_flags(ARG_TRE) # solar default is FALSE
get_env_flags(ARG_TRE) # solar default is FALSE
get_si_code(ARG_TRE)
get_timestamp(ARG_TRE)[1:10]
get_solar_timestamp(ARG_TRE)[1:10]
get_site_md(ARG_TRE)
get_stand_md(ARG_TRE)
get_species_md(ARG_TRE)
get_plant_md(ARG_TRE)
get_env_md(ARG_TRE)
```
    
  + `assignation`: methods for `"<-"` are implemented, allowing transformation
    of the slots. This is used internally in some functions but is not
    recommended except for updating metadata slots. In data slots this
    changes are not reflected in the flags, so the reproducibility and
    traceability will be lost if used without care.
    See `?sfn_replacement_methods` for more details.
    
```{r assignation}
# extraction and modification
foo_site_md <- get_site_md(ARG_TRE)
foo_site_md[['si_biome']]
foo_site_md[['si_biome']] <- 'Temperate forest'
# assignation
get_site_md(ARG_TRE) <- foo_site_md
# check it worked
get_site_md(ARG_TRE)[['si_biome']]
```
  
  + `validation`: a validation method is implemented in order to avoid creation
    of invalid sfn_data objects (i.e objects with different data dimensions,
    objects without metadata...)
    
```{r validation, error=TRUE, purl=FALSE, collapse=FALSE}
# get sap flow data
foo_bad_sapf <- get_sapf_data(ARG_TRE)
# pull a row, now it has diferent dimensions then
foo_bad_sapf <- foo_bad_sapf[-1,]
# try to assign the incorrect data fails
get_sapf_data(ARG_TRE) <- foo_bad_sapf[,-1] # ERROR
# try to build a new object also fails
sfn_data(
  sapf_data = foo_bad_sapf[,-1], # remember to remove timestamp column
  env_data = get_env_data(ARG_TRE)[,-1],
  sapf_flags = get_env_flags(ARG_TRE)[,-1],
  env_flags = get_env_flags(ARG_TRE)[,-1],
  si_code = get_si_code(ARG_TRE),
  timestamp = get_timestamp(ARG_TRE),
  solar_timestamp = get_solar_timestamp(ARG_TRE),
  site_md = get_site_md(ARG_TRE),
  stand_md = get_stand_md(ARG_TRE),
  species_md = get_species_md(ARG_TRE),
  plant_md = get_plant_md(ARG_TRE),
  env_md = get_env_md(ARG_TRE)
)
```

## Utilities

`sapfluxnetr` package offers some utilities to visualize and work with sfn_data
objects:

### `sfn_plot`

This function allows to plot `sfn_data` objects. See `?sfn_plot` for more
details.
  
```{r sfn_plot, fig.width=6}
library(ggplot2)

sfn_plot(ARG_TRE, type = 'env') +
  facet_wrap(~ Variable, ncol = 3, scales = 'free_y') +
  theme(legend.position = 'none')

sfn_plot(ARG_TRE, formula_env = ~ vpd) +
  theme(legend.position = 'none')
```
  
### `sfn_filter`

This function emulates `filter` function from `dplyr` package for `sfn_data`
objects. Useful to filter by some specific timestamp. Be advised, using this
function to filter by sap flow or environmental variables can create TIMESTAMP
gaps. See `sfn_filter` for more details.
    
```{r sfn_filter}
library(lubridate)
library(dplyr)

# get only the values for november
sfn_filter(ARG_TRE, month(TIMESTAMP) == 11)
```
    
### `sfn_mutate`

This function allows mutation of data variables inside the `sfn_data` object.
Useful when you need to transform a variable to another units or similar. A flag
('USER_MODF') will be added to all values in the mutated variable. See
`sfn_mutate` for more details.

  > At this moment, mutate does not allows creating new variables, only mutate
    existing variables
    
```{r sfn_mutate}
# transform ws from m/s to km/h
foo_mutated <- sfn_mutate(ARG_TRE, ws = ws * 3600/1000)
get_env_data(foo_mutated)[['ws']][1:10]
```
  
### `sfn_mutate_at`

This function mutates all variables declared with the function provided. Useful
when you need to conditionally transform the data, i.e. converting to NA sap flow
values when an environmental variable exceeds some threshold. See
`sfn_mutate_at` for more details.
    
```{r sfn_mutate_at}
foo_mutated_2 <- sfn_mutate_at(
  ARG_TRE,
  vars(one_of(names(get_sapf_data(ARG_TRE)[,-1]))),
  list(~ case_when(
    ws > 25 ~ NA_real_,
    TRUE ~ .
  ))
)

# see the difference between ARG_TRE and foo_mutated_2
get_sapf_data(ARG_TRE)
get_sapf_data(foo_mutated_2)
```

### `*_metrics` functions

Family of functions to aggregate and summarise the site data. See `?metrics` for
more details.
    
```{r metrics}
foo_daily <- daily_metrics(ARG_TRE)
foo_daily[['sapf']][['sapf_gen']]
```

For full control of metrics and custom aggregations see
`vignette('custom-aggregation', package = 'sapfluxnetr')`

# `sfn_data_multi`

`sfn_data_multi` is an S4 class designed to store multiple `sfn_data` objects.
It inherits from list so, in a nutshell, `sfn_data_multi` is a list of
`sfn_data` objects.

## Methods

`sfn_data_multi` has the following methods declared.
  
  + `show`: it shows the number, codes and combined timestamp span of sites in the
    multi object.
  
```{r show_multi}
# creating a sfn_data_multi object
data(ARG_MAZ, package = 'sapfluxnetr')
data(AUS_CAN_ST2_MIX, package = 'sapfluxnetr')
multi_sfn <- sfn_data_multi(ARG_TRE, ARG_MAZ, AUS_CAN_ST2_MIX)

# show method
multi_sfn
```

  + `get` methods: they work the same as in the `sfn_data` objects,
    except they return a list, with the corresponding data for the sites in the
    `sfn_data_multi` object as elements.

```{r get_multi}
# get sap flow data
get_sapf_data(multi_sfn)
# get plant metadata
get_plant_md(multi_sfn)
# with metadata, we can collapse
get_plant_md(multi_sfn, collapse = TRUE)
```


## Utilities

All the utilities thet exists for `sfn_data` work for `sfn_data_multi` objects,
executing the function for all the sites contained in the `sfn_data_multi`
object:

### `sfn_plot`

```{r plot_multi, fig.show='hold', fig.width=3.4}
multi_plot <- sfn_plot(multi_sfn, formula = ~ vpd)
multi_plot[['ARG_TRE']] + theme(legend.position = 'none')
multi_plot[['AUS_CAN_ST2_MIX']] + theme(legend.position = 'none')
```


### `sfn_filter`

```{r sfn_filter_multi}
multi_filtered <- sfn_filter(multi_sfn, month(TIMESTAMP) == 11)
get_timestamp(multi_filtered[['AUS_CAN_ST2_MIX']])[1:10]
```

### `sfn_mutate`
  
```{r sfn_mutate_multi}
multi_mutated <- sfn_mutate(multi_sfn, ws = ws * 3600/1000)
get_env_data(multi_mutated[['AUS_CAN_ST2_MIX']])[['ws']][1:10]
```

### `sfn_mutate_at`

```{r sfn_mutate_at_multi}
vars_to_not_mutate <- c(
  "TIMESTAMP", "ta", "rh", "vpd", "sw_in", "ws",
  "precip", "swc_shallow", "ppfd_in", "ext_rad"
)

multi_mutated_2 <- sfn_mutate_at(
  multi_sfn,
  vars(-one_of(vars_to_not_mutate)),
  list(~ case_when(
    ws > 25 ~ NA_real_,
    TRUE ~ .
  ))
)

multi_mutated_2[['ARG_TRE']]
```

### `*_metrics`

```{r metrics_multi}
multi_metrics <- daily_metrics(multi_sfn)
multi_metrics[['ARG_TRE']][['sapf']]
```
