---
title: "Memory and Parallelization"
author: "Victor Granda (Sapfluxnet Team)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Memory and Parallelization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Memory

In order to be able to work with the whole database at the *sapwood* or *plant*
level it is recommended at least $16GB$ of RAM memory. This is because loading
all data objects already consumes $4GB$ and any operation like aggregation or
metric calculation results in extra memory needed:

```{r memory_all, eval=FALSE}
library(sapfluxnetr)

# This will need at least 5GB of memory during the process
folder <- 'RData/plant'
sfn_metadata <- read_sfn_metadata(folder)

daily_results <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Woodland/Shrubland'),
    sites = sites, metadata = sfn_metadata
  ) %>%
  read_sfn_data(folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

# Important to save, this way you will have access to the object in the future
save(daily_results, file = 'daily_results.RData')
```

To circumvent this in less powerful systems, we recommend to work in small
subsets of sites (25-30) and join the tidy results afterwards:

```{r memory_steps, eval = FALSE}
library(sapfluxnetr)

folder <- 'RData/plant'
metadata <- read_sfn_metadata(folder)
sites <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Woodland/Shrubland'),
    sites = sites, metadata = sfn_metadata
  )

daily_results_1 <- read_sfn_data(sites[1:30], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_2 <- read_sfn_data(sites[31:60], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_3 <- read_sfn_data(sites[61:90], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_4 <- read_sfn_data(sites[91:110], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

daily_results_steps <- bind_rows(
  daily_results_1, daily_results_2,
  daily_results_3, daily_results_4
)

rm(daily_results_1, daily_results_2, daily_results_3, daily_results_4)
save(daily_results_steps, file = 'daily_results_steps.RData')
```

## Parallelization

`sapfluxnetr` includes the capability to parallelize the metrics calculation
when performed on a `sfn_data_multi` object. This is made thenks to the
[furrr](https://github.com/futureverse/furrr) package, which uses the
[future](https://github.com/futureverse/future) package behind the scenes.
By default, the code will run in a sequential process, which is the usual way
the R code runs. But setting the `future::plan` to `multicore` (in Linux),
`multisession` (in Windows) or `multiprocess` (automatically choose between the
previous plans depending on the system) will run the code in parallel, dividing
the sites between the available cores.

    > Be advised, parallelization usually means more RAM used, so in systems
      with less then 16GB maybe is not a good idea.
      Also, the time benefits start to show when analysing 10 sites or more.

```{r parallelizations, eval = FALSE}
# loading future package
library(future)

# setting the plan
plan('multiprocess')

# metrics!!
daily_results_parallel <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Woodland/Shrubland'),
    sites = sites, metadata = sfn_metadata
  ) %>%
  read_sfn_data(folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

# Important to save, this way you will have access to the object in the future
save(daily_results_parallel, file = 'daily_results_parallel.RData')
```


### Memory limit

When using `furrr`, even in the `sequential` plan, the `future` package sets
a limit of $500MB$ for each core. With sapfluxnet data this limit is easily
exceeded, causing an error. To avoid this we may want to set the
`future.globals.maxSize` limit to a higher value ($1GB$ for example, but the
limit wanted really depend on the plan and the number of sites):

```{r max_limit, eval = FALSE}
# future library
library(future)

# plan sequential, not really needed, as it is the default, but for the sake of
# clarity
plant('sequential')

# up the limit to 1GB, this in bytes is 1014*1024^2 
options('future.globals.maxSize' = 1014*1024^2)

# do the metrics
daily_results_limit <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Woodland/Shrubland'),
    sites = sites, metadata = sfn_metadata
  ) %>%
  read_sfn_data(folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

# Important to save, this way you will have access to the object in the future
save(daily_results_limit, file = 'daily_results_limit.RData')
```


