---
title: "Uncertainty estimation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Uncertainty estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette shows how to estimate the uncertainty of the efflux and production estimates using `bootstrap_error()`.
We will first generate a dataset of 'measurement uncertainty' in the input parameters and then use `bootstrap_error()` to estimate the resulting uncertainty in the models.

```{r setup}
library(ConFluxPro)
```

# Uncertainty of input variables

## General workflow and uncertainty from `gasdata`

The most relevant source of uncertainty in FGM-based models is that of the input parameters.
Especially parameters related to the estimation of the calculation of the diffusion coefficient `DS` are hard to measure and soil heterogeneity introduces more uncertainty.
We'll use the example dataset to demonstrate this effect. 
```{r load base_dat}
data("base_dat", package = "ConFluxPro")
mod_pf <- pro_flux(base_dat)
```

In this dataset, the molar fraction of CO~2~ is measured in three replicates at each depth within the soil. 
In a normal model, we use all replicates to fit a single profile. 
By randomly sampling from these replicates, we can estimate the uncertainty the concentration profiles measurement introduces into the model. 
This can be done in `bootstrap_error()`
```{r bootstrap-gasdata}
set.seed(42) # to get exactly same result every time 
mod_pf_bs_gasdata <- 
  bootstrap_error(mod_pf, 
                  n_samples = 25, 
                  sample_from = "gasdata")
```
With `n_samples = 25` we created 25 new models.
In each of these, a random selection of all measured `x_ppm` values was created per profile. 
This is done by resampling the same number of observations at each depth while allowing replacing (meaning that a single measurement may be sampled multiple times!). 
If we extract the efflux, we now get a second column `DELTA_efflux` that gives the uncertainty.
Notice, that the value of `efflux` has changed slightly from the original model.
This is because it is now estimated as the mean value of all 25 models, while the standard deviation is the estimate of `DELTA_efflux`.

```{r}
## after bootstrapping
mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  efflux()

## originial model
mod_pf %>%
  filter(prof_id == 1) %>%
  efflux()
```

Similarly, we can extract the `production()`, where respective columns have also been added.

```{r}
mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  production()
```

## Uncertainty from `soilphys`

However, not only the gas concentration data carries uncertainty.
Indeed, many parameters contained within the `soilphys` dataset are subject to significant uncertainty due to sampling error and soil heterogeneity.
In the present dataset this is not represented at all.
So, let's pretend that we measured `TPS` in three replicates identified by `replicate_id` that have some spread around the mean.
We will do this by randomly sampling from a normal distribution with a standard deviation of 0.01.
```{r}
library(dplyr)
soilphys <- cfp_soilphys(base_dat)

set.seed(42)
soilphys_replicate_TPS <-
soilphys %>%
  # get base TPS info
  select(site, upper, lower, TPS) %>%
  distinct() %>%
  rename(TPS_mean = TPS) %>%
  # repeat each row 3 times 
  cross_join(data.frame(replicate_id = 1:3)) %>%
  # generate new, random TPS values
  mutate(TPS = rnorm(n(), TPS_mean, 0.01)) %>%
  # join with rest of dataset
  left_join(soilphys %>% select(!TPS),
            by = c("site", "upper", "lower"),
            relationship = "many-to-many") %>%
  # recaluclate DS and c_air
  complete_soilphys(DSD0_formula = "a*AFPS^b", quiet = TRUE) %>%
  cfp_soilphys(id_cols = c("site", "Date", "replicate_id"))

```
Now we can create a new dataset that includes these new values of TPS and create a `cfp_pfmod` model.
Of course you could use real measurements instead and include the variability of other parameters (`SWC`, `t`, ...) as well.
We don't call `pro_flux()`, because we don't want to calulcate the result from each replication individual.
```{r}
replicate_dat <- 
  cfp_dat(cfp_gasdata(base_dat),
          soilphys_replicate_TPS,
          cfp_layers_map(base_dat))

mod_pf_replicate <-
  cfp_pfmod(replicate_dat)

```
Now we can run `bootstrap_error()` again.
This time, we will tell the function to resample from the `soilphys` dataset when creating the bootstrapping runs. 
To do this, we need to tell it which `id_cols` define(s) the replications of the dataset.
In our case, this is `replicate_id`.
Here, only a single value for each profile and depth is sampled for each run.
The new call looks like this:
```{r}
set.seed(42)
mod_pf_bs_soilphys <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "soilphys",
                  rep_cols = "replicate_id")
```
If we take a look into the result of `efflux()` again, we can see that the uncertainty of the efflux estimate has increased compared to using the variability in `gasdata`.
This is both because `TPS`, from which `DS` is derived, has a strong impact on the flux calculation and because the variability within the `gasdata` dataset is probably unrealistically low and more representative of an instrument than of a measurement error.

```{r}
## boostrapping from soilphys
mod_pf_bs_soilphys %>%
  filter(prof_id < 10) %>%
  efflux()

## boostrapping from gasdata
mod_pf_bs_gasdata %>%
  filter(prof_id < 10) %>%
  efflux()
```

Finally, we can also use the variability from both datasets at the same time.
This runs both sampling strategies independent from another.

```{r}
set.seed(42)
mod_pf_bs_both <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "both",
                  rep_cols = "replicate_id")
mod_pf_bs_both %>%
  filter(prof_id < 10) %>%
  efflux()
```