---
title: "dynamicSDM: Response variable data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{dynamicSDM: Response variable data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

*dynamicSDM* offers novel tools for developing species geographical distribution and abundance
modelling (SDM) at high spatiotemporal resolution. Across the dynamicSDM vignettes, we demonstrate
package functions on a case study species and guide users through the four key species distribution
modelling stages:

+ Stage 1: Organising response variable data including species occurrence and pseudo-absence records
+ Stage 2: Extracting spatio-temporally buffered explanatory variables 
+ Stage 3: Fitting SDMs whilst accounting for spatial and temporal autocorrelation 
+ Stage 4: Projecting intra- and inter-annual geographical distributions and abundances at high
spatiotemporal resolution.


## Installation

Before starting this tutorial, the *dynamicSDM* package needs to be installed. You can install the
latest version from [GitHub](https://github.com/r-a-dobson/dynamicSDM) with:

```{r install package GitHub}
# devtools::install_github('r-a-dobson/dynamicSDM')
```

or from [CRAN](https://CRAN.R-project.org/package=dynamicSDM) with

```{r install package CRAN}
#install.packages("dynamicSDM")
library(dynamicSDM)
library(terra)
```

Due to the utilisation of Google Earth Engine and Google Drive across *dynamicSDM* functions, you must
also set up a free Google account at [Google]( https://www.google.com/account/about/). Once you have
your Google log-in details, these can be used to authorise Google Earth Engine in the `rgee` package
and Google Drive in the `googledrive` package.

```{r check Google, eval=FALSE }
library(rgee)
#rgee::ee_install()
rgee::ee_check()

library(googledrive)
#googledrive::drive_auth_configure()
googledrive::drive_user()
```


## Directory organisation

As we continue through the species distribution modelling framework, functions will generate a range of outputs. To keep organised, we recommend a structured directory system. The code below will create a new project directory in your temporary folders, but you can edit the script to specify your home directory if you want to retain function outputs between R sessions. 

```{r create directory}
project_directory <- file.path(file.path(tempdir(), "dynamicSDM_vignette"))
dir.create(project_directory)
```


## Case study: the red-billed quelea 

In this four stage tutorial, we will be studying the case study species, the red-billed quelea
(Quelea quelea), a nomadic bird inhabiting sub-Saharan Africa. With highly variable distributions
driven by short-term weather and resource availability, *dynamicSDM* provides key functions to
accurately model quelea’s distribution through space and time.

## Stage 1: Response variable data

In this tutorial, we will be processing a sample of species occurrence records for the red-billed
quelea (Quelea quelea) to generate an SDM response variable data frame. This will involve filtering
records by spatio-temporal quality, extent and resolution; testing and accounting for spatio-temporal
biases in records, and generating pseudo-absences through space and time.

### a) Import occurrence records

A sample of red-billed quelea occurrence data can be loaded into your local R environment using the
code below.

```{r import occurrence}
data("sample_occ_data")
```

These records are formatted as exported from the Global Biodiversity Information Facility, with
columns labelled “decimalLongitude” and “decimalLatitude”. For *dynamicSDM* functions, occurrence data
must be in a specific format; containing numeric data in columns labelled “x” and “y” for the record
co-ordinates longitude and latitude, and “day”, “month” and “year” for the record’s date.

`convert_gbif()` can be used to convert between these formats, or this can be achieved manually for
data not extracted from GBIF.

```{r example-convert_gbif}
sample_occ <- convert_gbif(sample_occ_data)
```


### b) Filter occurrence records

#### Validity and completeness

Species occurrence records may contain anomalous or missing values in the co-ordinates or dates
given. `spatiotemp_check()` can be used to exclude records with missing (NA) values (argument
`na.handle`), invalid co-ordinates (`coord.handle`), invalid dates (`date.handle`) and duplicate
records (`duplicate.handle`).

Here, we specify `date.res = "day"` to check all date columns from year to day. If you only want to
check the year of record dates, then you could specify `date.res = "year"`.

```{r example-spatiotemp_check}
sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ,
                                               na.handle = "exclude",
                                               date.handle = "exclude",
                                               date.res = "day",
                                               coord.handle = "exclude",
                                               duplicate.handle = "exclude")
nrow(sample_occ_filtered)
```

Additionally, users can opt to take advantage of the `CoordinateCleaner` package functions to exclude
records flagged as containing potentially anomalous co-ordinates, based on a wider variety of
spatial tests than *dynamicSDM* alone offers.

```{r example-spatiotemp_check with Coordinate Cleaner,eval=F}
sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ,
                                        na.handle = "exclude",
                                        date.handle = "exclude",
                                        date.res = "day",
                                        coord.handle = "exclude",
                                        duplicate.handle = "exclude",
                                        coordclean = TRUE,
                                        coordclean.species = "quelea",
                                        coordclean.handle = "exclude")
```


#### Spatiotemporal extent

When generating dynamic species distribution models, it is important that occurrence records match
the spatio-temporal extent of the study and any remote-sensing datasets to be utilised as explanatory
variables. `spatiotemporal_extent()` removes any records outside of the set extents.

In this case study, we want to model the subspecies Q. q. lathamii which is typically limited to
southern Africa and we will incorporate remote sensing datasets available for the years 2001 to
2019. To filter records to these extents, we use a polygon of southern African countries, which can
be read into your R environment and applied using the following code.

```{r example-spatiotemp_extent,fig.width=4, fig.height=4}
data("sample_extent_data")

sample_occ_cropped <- spatiotemp_extent(occ.data = sample_occ_filtered,
                                               temporal.ext = c("2001-01-01", "2018-12-01"),
                                               spatial.ext = terra::ext(sample_extent_data),
                                               prj = "+proj=longlat +datum=WGS84 +no_defs")


## Lets plot the change
#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_filtered[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"))

#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"))
```


#### Spatiotemporal resolution

The spatial resolution of occurrence record co-ordinates and the temporal resolution of occurrence
record dates are also important when generating dynamic SDMs. `spatiotemp_resolution()` can filter
records by set thresholds.

As quelea migrations are driven by highly local and short-term environmental factors, we will filter
to exclude records not given to the specific day and co-ordinates to four decimal places.

```{r example-spatiotemp_resolution}
sample_occ_cropped<-spatiotemp_resolution(occ.data = sample_occ_cropped,
                                          spatial.res = 4,
                                          temporal.res = "day")

nrow(sample_occ_cropped) # Even more records have been removed!
```


### c) Test for spatial and temporal sampling biases

For many reasons, the collection of species occurrence records are biased through space and time.
These biases can impact SDM performance by over- or under- representing conditions at a given site
or time. `spatiotemp_bias()` returns the results of statistical tests and plots for visual
assessment of such biases. 
The code below tests for bias across the entire dataset and species range.

```{r example-spatiotemp_bias simple}
bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
                                temporal.level = c("year"),
                                plot = FALSE,
                                spatial.method = "simple",
                                prj = "+proj=longlat +datum=WGS84")
bias_results
```

Whereas, this code will test for bias across only the core of the species range. 

```{r example-spatiotemp_bias core }
bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
                                temporal.level = c("year"),
                                plot = FALSE,
                                spatial.method = "core",
                                prj = "+proj=longlat +datum=WGS84")
bias_results
```

Both approaches can generate important results. The latter may be a more reliable indicator if the
species range is expanding or shifting in response to global change. This is because records at the
range periphery that have been collected randomly may still not be evenly distributed through space
and time due to underlying ecological processes. Therefore, testing for biases on the species range
core may be more representative of true spatial and temporal sampling biases.

### d) Account for spatial and temporal sampling biases

#### Occurrence record thinning

One approach to correct spatio-temporal sampling biases are to “thin” records, which involves
excluding records that have been collected too closely in space or time. `spatiotemp_thin()` allows
users to specify the spatial and temporal distance to thin records by. An important argument of
`spatiotemp_thin()` is `spatial.split.degrees` which specifies the size of grid cells to split
occurrence records into before temporal thinning. This prevents spatially distant but temporally
close records from being filtered out.

```{r example-spatiotemp_thin,fig.width=4, fig.height=4}
occ_thin <- spatiotemp_thin(occ.data = sample_occ_cropped,
                                   temporal.method = "day",
                                   temporal.dist = 30,
                                   spatial.split.degrees = 3,
                                   spatial.dist = 100000,
                                   iterations = 5)

# Plot the difference in spatial distribution after thinning
# Non-thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)

# Thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(occ_thin[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)
```


#### Weight records by sampling effort

Another approach to account for spatio-temporal sampling biases is to weight records by sampling
effort when fitting SDMs.This results in the downweighting of the importance of records from
over-sampled regions and vice versa. 

Here, we will read in a sample of e-Bird avian sampling records
in southern Africa to use as a proxy for sampling effort, and sum these across a spatial and temporal
buffer from each occurrence record using `spatiotemp_weights()`.

```{r example-spatiotemp_weights}
data("sample_events_data")
sample_occ_cropped <- spatiotemp_weights(occ.data = sample_occ_cropped,
                                       samp.events = sample_events_data,
                                       spatial.dist = 100000,
                                       temporal.dist = 30,
                                       prj = "+proj=longlat +datum=WGS84")
```


### e) Generate pseudo-absences through space and time

Often the paucity of species absence records necessitates the generation of pseudo-absences, which
are inferred absences based upon presence records. When modelling a species distribution at high
spatiotemporal resolution, it may be best to generate pseudo-absences within close spatial and
temporal buffers of species occurrence. This is because model presence-absence comparisons are made
at fine scales, instead of coarsely between biomes and seasons. Using `spatiotemp_pseudoabs()` you
can specify the spatial and temporal buffer size or extents to randomly generate pseudo-absences
within.

```{r example-spatiotemp_pseudoabs,fig.width=4, fig.height=4}
# Pseudo-absences generated within spatial and temporal buffer
pseudo_abs_buff <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
                                        spatial.method = "buffer",
                                        temporal.method = "buffer",
                                        spatial.ext = sample_extent_data,
                                        spatial.buffer = c(250000, 500000),
                                        temporal.buffer = c(42, 84),
                                        n.pseudoabs = nrow(sample_occ_cropped))

# Pseudo-absences generated randomly across given spatial and temporal extent
pseudo_abs_rand <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
                                        spatial.method = "random",
                                        temporal.method = "random",  
                                        spatial.ext = sample_extent_data,
                                        temporal.ext = c("2002-01-01", "2019-12-01"),
                                        n.pseudoabs = nrow(sample_occ_cropped))

# Plot the spatial distribution of pseudo-absences (red) compared to occurrence records for:
# Buffered
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_buff[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)

# Random
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_rand[, c("x", "y")],
                        geom = c("x", "y"),
                        crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)
```

Then, the generated pseudo-absences need to be added to the occurrence record data frame by adding
relevant columns and binding the rows together.

```{r complete_dataset,eval=F}
pseudo_abs_buff$decimalLatitude <- pseudo_abs_buff$y
pseudo_abs_buff$decimalLongitude <- pseudo_abs_buff$x
pseudo_abs_buff$occurrenceStatus <- rep("absent", nrow(pseudo_abs_buff))
pseudo_abs_buff$source <- rep("dynamicSDM", nrow(pseudo_abs_buff))
pseudo_abs_buff<-spatiotemp_weights(occ.data = pseudo_abs_buff,
                                       samp.events = sample_events_data,
                                       spatial.dist = 100000,
                                       temporal.dist = 30,
                                       prj = "+proj=longlat +datum=WGS84")

complete.dataset <- as.data.frame(rbind(sample_occ_cropped, pseudo_abs_buff))
```

## Summary
At the end of this vignette, we now have a complete data frame of filtered species occurrence and
pseudo-absence records of appropriate spatio-temporal quality, extent and resolution. Let’s save this
to our project directory for use in the next tutorial!

```{r export_dataset,eval=F}
write.csv(complete.dataset, file = paste0(project_directory, "/filtered_quelea_occ.csv"))
```





