---
title: "Get started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Get started}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This is a brief introduction to AMBI index calculations with ambiR.

## Structuring species observation data

Species counts (or abundance) should be organized in a data frame
arranged in *long* format. That is, species names are in one column and
species counts in another column. If the data represents several
stations and/or if there are replicates, then the data should include
columns with this information.

Looking at the `test_data` example dataset included with the ambiR package, 
one can see how the data should be arranged:

```{r test_data}
library(ambiR)

head(test_data)
```

If your data look like this, you can go directly to [Calculate AMBI]. If
not, the following examples show how to reorganize your data.

### Species names in columns {#data-with-species-names-in-columns}

Here is an example dataframe where there are counts for species in
separate columns:

```{r transpose_wide, include=FALSE}
library(dplyr)
library(tidyr)

wide_data_species <- test_data %>%
  pivot_wider(names_from = "species", 
              values_from = "count",
              values_fill = 0)
```

```{r data_wide_spec}
head(wide_data_species)
```

To arrange the data in the correct form, use `tidyr::pivot_longer()`:  

```{r pivot_species}
# columns 1 and 2 contain station and replicate information
# so, select all columns from 3 to be pivoted 

long_data <- wide_data_species %>%
  pivot_longer(cols = 3:ncol(wide_data_species), 
               names_to = "species",
               values_to = "count")

head(long_data)
```

### Stations and replicates in columns

Here is an example where each column contains species counts for a
station and replicate. The first row of the table contains the station
ID *1, 2, ...* and the second row contains the replicate ID *a, b, ...*.
The first column of the table contains species names.

```{r transpose_wide_stns, include=FALSE}
df <- test_data %>%
  pivot_wider(names_from = c("station","replicate"), 
              values_from = "count", values_fill = 0)

stns <- names(df)[2:ncol(df)]

stns <- stns %>% 
  sapply(strsplit,"_") 
reps <- stns  %>%  
  sapply(function(x){x[2]})
stns <- stns  %>%  
  sapply(function(x){x[1]})
n <- 1+length(stns)
df2 <- matrix(nrow=2, ncol=n) %>% 
  as.data.frame()

df2[1,2:n] <- stns
df2[2,2:n] <- reps
names(df) <- names(df2)
df <- df %>%
  mutate(across(all_of(names(df)), as.character))

df <- df2 %>%
  bind_rows(df)

names(df) <- rep("", ncol(df))

wide_data_stns <- df
```

```{r data_wide_stns}
head(wide_data_stns)
```

Note that if the if the observation data *only* has stations *or*
replicates, then rearranging the data can be done in one step, as in the
[previous example](#data-with-species-names-in-columns). But in this
case, the station and replicate information are in separate rows. The
restructuring process is a little more complicated.

#### Combine station and replicate values

Before we can use a pivot function, we need to combine the station ID
and replicate ID for each column into a single value. In this example,
each station ID and replicate ID are joined into a single character
value, with an underscore to separate them *`_`*. The underscore will be
used again after pivoting the table to identify where to split the
combined station/replicate values back into separate values again.

If there are station names which already contain underscores, then
another suitable character should be used when joining and splitting
station/replicate IDs.

```{r combine_stn_rep}
sep_character <- "_"

# get the station IDs from row 1, excluding column 1 (this contains species names)
stations <- wide_data_stns[1, 2:ncol(wide_data_stns)]

# get the replicate IDs from row 2
replicates <- wide_data_stns[2, 2:ncol(wide_data_stns)]

# combine the station and replicate for each column using an underscore
station_replicate <- paste(stations, replicates, sep=sep_character)

# now we have extracted the station/replicate information, we can drop the 
# first two rows of the data frame
wide_data_stns <- wide_data_stns[3:nrow(wide_data_stns),]

# apply the station_replicates as column names
names(wide_data_stns) <- c("species", station_replicate)

head(wide_data_stns)
```

#### Transpose using the combined station/replicate variable

We can see that the column names now contain the combined station and
replicate information. We are ready to transpose the data.


```{r pivot_stations}
# column 1 contains species names
# so, select all columns from 2 to be pivoted 

long_data <- wide_data_stns %>%
  pivot_longer(cols = 2:ncol(wide_data_stns), 
               names_to = "stn_rep",
               values_to = "count")

head(long_data)
```

#### Retrieve station and replicate variables from the transposed data

Now we can split the *stn_rep* column into separate columns for
*station* and *replicate*, We will use the underscore we introduced
earlier to identify where the split should occur:

```{r split_stations}
long_data <- long_data %>%
  separate_wider_delim(cols="stn_rep", 
                       delim = sep_character,
                       names=c("station", "replicate"))

head(long_data)
```

Now we are ready to calculate AMBI...

## Calculate AMBI

We have now ensured that our species abundance/count data have the
correct structure, as in the example `test_data` provided:

```{r test_data2}
head(test_data)
```

Call the `AMBI()` function:

```{r run_ambi}
res <- AMBI(test_data, by="station", var_rep="replicate", 
            var_species="species", var_count="count")
```

### Results

The `AMBI()` function returns a list of three dataframes:

-   `.$AMBI`
-   `.$AMBI_rep`
-   `.$matched`

`.$AMBI`- the main results with the `AMBI` index calculated for each
unique combination of `by` variables, in our case the results are per
`station`.

In addition to the `AMBI` index, the results also include the *Shannon diversity
index* `H'` and the *Species richness* `S`, the three
metrics which are necessary to calculate [M-AMBI](#calculate-m-ambi).

```{r show_ambi}
res$AMBI
```

`.$AMBI_rep` - if the observations include replicates, then the
function also returns results calculated for each replicate, within each
unique combination of `by` variables:

```{r show_ambi_rep}
res$AMBI_rep
```

`.$matched` - for each observation, this dataframe shows which species
in the `AMBI` list the observed species was matched with, if any. It
also shows the `AMBI` species group assigned. This dataframe has the
same number of rows as the input data.

```{r show_matched}
head(res$matched)
```

For more information about the principles underlying the `AMBI` calculations and
the grouping of species according to sensitivity to pollution,
see `vignette("background")`.

## Calculate M-AMBI {#calculate-m-ambi}

Calculate M-AMBI the multivariate AMBI index, based on the three
separate species diversity metrics:

-   AMBI index `AMBI`.
-   Shannon diversity index `H`
-   Species richness `S`.

All three indices required to calculate `MAMBI()`are included in the
results returned by the `AMBI()` function.

### Limit values

In addition to index values calculated from observed species data, the
M-AMBI factor analysis requires values defining the limits for the three
metrics, corresponding to the best and worst possible conditions.

See @MUXIKA200716 for more details.

The default limit values used by `MAMBI()` are:

```{r limits_mambi}
limits_AMBI <- c("bad" = 6, "high" = 0)

limits_H  <- c("bad" = 0, "high" = NA)

limits_S  <- c("bad" = 0, "high" = NA)
```

By default, upper limit values for `H` and `S` are not provided (`= NA`).
If they are not provided explicitly, then the maximum values found in
the input data will be used. The results returned by `MAMBI()` show the
limits used.

### Status class boundaries

`MAMBI()` also returns the status class (*Bad, Poor, Moderate, Good* or
*High*) for each M-AMBI index value. To do this, it compares the
calculated M-AMBI index value with values defining the boundaries
between status classes.

-   `PB` - *Poor* / *Bad*
-   `MP` - *Moderate* / *Poor*
-   `GM` - *Good* / *Moderate*
-   `HG` - *High* / *Good*

The default values for the class boundaries are:

```{r bounds_mambi}
bounds <-c("PB" = 0.2, "MP" = 0.39, "GM" = 0.53, "HG" = 0.77)
```

### Call `MAMBI()`

We call `MAMBI()` using the previously generated `AMBI` results:

```{r calc_mambi}
res_mambi <- MAMBI(res$AMBI, var_H = "H", var_S = "S", var_AMBI = "AMBI")
```

### M-AMBI results

In addition to retaining the values for `AMBI`. `H` and `S`, the
results include the following information:

-   `x`, `y`, `z` - factor scores giving coordinates in the new factor
    space.
-   `MAMBI` - the calculated M-AMBI index value
-   `Status` - the status class for the M-AMBI index value
-   `EQR` - the normalised [EQR](#eqr-values) index

The dataframe returned contains 2 more rows than the input data. Our
input data contained 3 rows (1 for each `station`). The results contain
5 rows. The 2 *additional* rows show the limit values for each of the
three metrics used in the M-AMBI calculations.

```{r mambi_results}
res_mambi %>%
  select(station, AMBI, H, S, x, y, z, MAMBI, Status, EQR)
```

#### EQR values {#eqr-values}

As well as the status class, the function also returns a normalised
index value from 0 to 1 (*EQR*), calculated using the `bounds` boundary
values for the M-AMBI index. The following EQR values correspond to
status class boundaries:

-   `0.2` - *Poor* / *Bad*
-   `0.4` - *Moderate* / *Poor*
-   `0.6` - *Good* / *Moderate*
-   `0.8` \_ *High* / *Good*

For example, the M-AMBI value for station 3 is `0.478`. This lies
between the M-AMBI value corresponding to the *Moderate/Poor* status
class boundary (`"MP" = 0.39`) and the M-AMBI value corresponding to the
*Good/Moderate* status class boundary (`"GM" = 0.53`).

The normalised EQR value is given by linear interpolation: 
$$ EQR = 0.4 + (0.6 - 0.4) \frac{0.478 - 0.39}{0.53 - 0.39 } $$

## References