---
title: "2 - Running Segmentation/Clustering with segclust2d"
author: "R. Patin"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2 - Running Segmentation/Clustering with segclust2d}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r option chunk, echo = FALSE}
options(Encoding="UTF-8")
knitr::opts_chunk$set(
  fig.width = 8,
  fig.height = 5,
  collapse = TRUE,
  comment = "#>"
)
```



```{r load segclust2d}
library(segclust2d)
```

The package `segclust2d` provides access to two algorithms:

- It can perform a [segmentation](#segmentation) of the time-series into
homogeneous segments. A typical case is the identification of home-range shifts

- It can also perform an integrated classification of those segments into
clusters of homogeneous behaviour through a
[segmentation/clustering](#segmentation-clustering) algorithm. 
This can be used to identify behavioural modes.

Input data used in the examples here is a `data.frame` object  but it can also
be a `Move` object, a `ltraj` object (from package `adehabitatLT`) or a `sftraj`
object, both shown in section
[Preparing data for Segmentation/Clustering with segclust2d](https://rpatin.github.io/segclust2d/articles/v01_preparing_data.html)

# Presentation of data sets

## simulshift: three successive home-range shifts

```{r load simulshift, fig.show='hold'}
data(simulshift)
```

Here we will load a test dataset: `simulshift`, containing a simulation of
home-range behaviour with two shifts. It is a data.frame with two columns for
coordinates : x and y. We can now run a simple segmentation with this dataset to
find the different home-ranges.

```{r plot simulshift, echo = FALSE, fig.width = 5,  fig.height = 4, fig.cap = "`simulshift`: simulation of movement within three successive home-range. Data shown after subsampling by 100."}
library(ggplot2)
tmpdf <- simulshift[seq(1,30000, by = 100),]
tmpdf$class <- factor(rep(c(1,2,3), each = 102)[1:300])
ggplot(tmpdf)+
  geom_path(aes(x = x, y = y))+
  geom_point(aes(x= x, y = y, col = factor(class)))+
  scale_color_discrete("home-range")+
  theme(legend.position= "top")
```

## simulmode: a simulation with three movement modes

```{r load simulmode, fig.show='hold'}
data(simulmode)
simulmode$abs_spatial_angle <- abs(simulmode$spatial_angle)
simulmode <- simulmode[!is.na(simulmode$abs_spatial_angle), ]
```

`simulmode` is an example dataset containing a movement simulation with three
movement modes. It is a data.frame with 11 columns, with location coordinates
and several covariates. In addition to loading we calculated here the absolute
value for the turning angle at constant step length (here called
`spatial_angle`). Data are also checked for missing values.

```{r map simulmode, echo = FALSE, fig.width = 5,  fig.height = 4, fig.cap = "`simulmode`: simulation of movement with three different behavioural modes."}

library(ggplot2)
tmpdf <- simulmode
tmpdf$class <- factor(rep(c(1,2,3), each = 20, 5))
ggplot(tmpdf)+
  geom_path(aes(x = x, y = y))+
  geom_point(aes(x= x, y = y, col = factor(class)))+
  scale_color_discrete("behavioural mode")+
  theme(legend.position= "top")
```

# Segmentation

## Setting `segmentation()` parameters.

### Required: `lmin`: minimum length of a segment

To run the `segmentation()` function, argument `lmin` needs to be provided.
`lmin` is the minimum length of a segment. It is has to be set not only to speed
up the algorithm, but also, more fundamentally, to prevent over‐segmenting,
based on biological considerations. For example, setting `lmin` to a value of a
few weeks when analysing locational time series will prevent the algorithm from
considering an area exploited only for a few days, corresponding to foray
outside the usual home range or to stopover during migration, as a distinct home
range. Similarly, setting `lmin` to a value long enough (depending on the
species) when looking for changes of behavioural modes will force the algorithm
to assign a given behavioural bout to a given mode even when it is interspersed
by ephemeral events related to another behaviour (e.g. a long transit with
opportunistic short feeding events on the move will be considered as a single
transit phase).

Statistically, `lmin` cannot be `<5`, because of the need to estimate variances
for each segment. To avoid such case, the program will fix lmin to a minimum of
`5`.

### Optional: `Kmax`: the maximum number of segments.

By default, `Kmax` will be set to `0.75*floor(n/lmin)`, with `n` the number of
observations, so you can omit this argument. You can however provide a different
value, as a large value may compromise speed or selection of optimal number of
segments.

### Optional: `seg.var`
 
You can specify the variables to be segmented using argument `seg.var`. By
default, function segmentation will use variables `x` and `y` for a data.frame
or use the coordinates for `move` and `ltraj` objects.
 
### Optional: `scale.variable`

The function allow rescaling of variable with argument `scale.variable`.
Rescaling sets variable in `seg.var` to a mean of 0 and a variance of 1. It is
recommended if the variables chosen for segmentation are of different nature (e.g.
speed and turning angle). It is therefore inadvisable for segmentation of
coordinates to identify home-ranges. By default `segmentation()` disables
rescaling and sets `scale.variable = FALSE`.

### Optional: `diag.var`

`segmentation()` uses variables provided in `seg.var` and automatically produces
summary statistics for each variable and each segments (mean and standard
deviation). In case you want summary statistics for other variables than
`seg.var` you can specify them in `diag.var`. By default `diag.var` is set to
`seg.var`.

### Optional: `order.var`

`order.var` is the variable used for ordering segments in some of the output functions
(see section [Exploring Outputs from segclust2d](https://rpatin.github.io/segclust2d/articles/v03_explore_outputs.html)
for details.). By default `order.var` is set to `seg.var[1]`

### Deprecated: `type`, `coord.names`

Older version of `segclust2d` had different arguments, now deprecated. `type` was either `"home-range"` or `"behaviour"`, which provided different default arguments:

- `type = "home-range"` was associated with `scale.variable = FALSE` and
performed a segmentation based on the variables proposed as `coord.names`.
Default value was `coord.names = c("x","y")`.
- `type = "behaviour"` was associated with, `scale.variable = TRUE` and
performed a segmentation based on the variables proposed as `seg.var`, with no
default value.

From version 0.3.0 and forward, those arguments were removed so that the
functions now work only with arguments `scale.variable` and `seg.var`. Using the
arguments `type` makes the function fail and triggers message explaining the
required changes:

```{r wrong type argument call, eval = FALSE}
df.seg <- segmentation(simulshift, 
                       type = "home-range",
                       lmin = 300, Kmax = 10,
                       subsample_by = 60)
```

```{r wrong type argument message, echo=FALSE, results='asis'}
        cli::cli_alert_danger("Argument {cli::col_red('type')} \\
                     is deprecated and should not be used")
        cli::cli_alert_danger("Argument {cli::col_red('coord.names')} \\
                     is deprecated and should not be used")
        coord.names <- c("x","y")
        cli::cli_alert("Please use instead \\
                    {.field seg.var = {deparse(coord.names)}} and \\
                    {.field scale.variable = FALSE}")
```


## Running `segmentation()`

`segmentation()` uses a Dynamic Programming algorithm that finds the best
segmentation given a number of segments. The function runs the dynamic
programming for each number of segments `<Kmax`. As a result, the optimal
segmentation is associated with a likelihood value for each number of segment.
Section [Different running configurations](#different-running-configurations)
shows several examples of running `segmentation()` with different options.
Section [Selecting the number of segments](#selecting-the-number-of-segments)
discusses the selection of the number of segments using either Lavielle's
criterium or graphical explorations.

### Different running configurations

In this section we will try different ways of calling the `segmentation()`
function. Minimal call only contains the data concerned (here the data.frame
`simulshift`) and argument `lmin`. The function will automatically complete the
missing arguments with default values and communicate about it (not shown here).

*`subsample_by` argument is described in section [Subsampling](https://rpatin.github.io/segclust2d/articles/v01_preparing_data.html#subsampling)*

```{r minimal segmentation call, eval = FALSE, message = FALSE}
shift_seg <- segmentation(simulshift, 
                          lmin = 240,
                          subsample_by = 60)
```

It is however advised to give a more reasonable 'Kmax' value to decrease calculation time.

```{r Kmax segmentation call, eval = TRUE, message = FALSE}
shift_seg <- segmentation(simulshift,
                          lmin = 240, Kmax = 25,
                          subsample_by = 60)
```

There is a check to ensure that `lmin*Kmax < n` with `n` the number of data. If
you provided inadequate values, `Kmax` will be adjusted to an appropriate value
if possible and you should get a message like this:

```{r segmentation warning lmin*Kmax, echo=FALSE, results='asis'}
Kmax = 25
      cli::cli_alert_warning(
        "Adjusting Kmax so that lmin*Kmax < nrow(x). Now, \\
      {cli::col_yellow('Kmax = ', Kmax)}")
    
```

If not possible you will get an error and this message:

```{r segmentation error lmin*Kmax, error = TRUE, echo=FALSE, results='asis'}
        cli::cli_alert_danger(
          "lmin*Kmax > nrow(x) and Kmax cannot be adjusted. \\
          Please provide lower values for lmin")
   stop("lmin*Kmax > nrow(x)")
```

By default the function is looking for column `c(x,y)` or the coordinates (in
the case of a `Move` or a `ltraj` object). Alternatively the user can provide
its own variables, depending on its aim:

```{r providing seg.var, eval = FALSE, message = FALSE}
shift_seg <- segmentation(simulshift,
                          seg.var = c("x","y"),
                          lmin = 240, Kmax = 25,
                          subsample_by = 60)
```


### Selecting the number of segments

 Once the segmentation have been successfully run, a summary like this will appear
 
```{r segmentation summary, echo = FALSE, results = 'asis'}
    cli::cli_alert_success("Best segmentation estimated with \\
                      {shift_seg$Kopt.lavielle} segments, \\
                      according to Lavielle's criterium")
    cli::cli_text(cli::col_grey(
    'Other number of segments may be selected 
      by looking for likelihood breaks with plot_likelihood()'))
    cli::cli_text(cli::col_grey(
    'Results of the segmentation may be explored with plot() and segmap()'))

```

By default, the algorithm chooses the number of segments given a criterium
developed by Marc Lavielle based on the value of the second derivative of the
penalized likelihood. This criterium uses a threshold value of `S = 0.75`, but a
different threshold can be specified to `segmentation()` if needed.

As stated it is important to check that the number of segments selected
corresponds to a clear break in log-likelihood and if it is not the case to
select a better value. This can be checked with `plot_likelihood` that shows the
log-likelihood of the best segmentation versus the number of segments and
highlights the one chosen with Lavielle's criterium. The likelihood should show
an increasing curve with a clear breakpoint for the optimal number of segments.
Note that with real data breaks are often less clear than for that example. An
artifactual decrease of likelihood can happen for large number of segments when
Kmax is too high (close to `n/lmin`) and corresponds generally to an
oversegmentation (in such case, Kmax should be decreased).


```{r plot_likelihood example, fig.show = 'hold', fig.width = 4, fig.height = 3}
plot_likelihood(shift_seg)
```

Another example on the `simulmode` dataset shows a situation where the number of
segments automatically selected by Lavielle's criterium is 1, despite a
(relatively) clear break in the likelihood at `nseg = 15`. In such case it is of
paramount importance to explore the log-likelihood to confirm or select an
appropriate number of segments.

*Note that in this case the problem in the selection of the number of segments would have been avoided by setting `Kmax = 25`. For Lavielle's criterium to work properly `Kmax` should be clearly larger than your expected number of segment.*

```{r plot_likelihood mode_seg, fig.show = 'hold', fig.width = 4, fig.height = 3, message = FALSE}

mode_seg <- segmentation(simulmode,
                          lmin = 10, Kmax = 20,
                         seg.var = c("speed","abs_spatial_angle"),
                          scale.variable = TRUE)

plot_likelihood(mode_seg)
```


# Segmentation-Clustering

We will now run the joint segmentation/clustering `segclust()` function on the
simulmode data to identify the different behavioural modes. As with
`segmentation()`, you can specify the variables to be segmented using argument
`seg.var`.



## Setting `segclust()` parameters.

`segclust()` shares most of its parameters with `segmentation()` so you can read
section [Setting `segmentation()` parameters.](#setting-segmentation-parameters--1) 
for the parameters missing description here. The only additional parameter is
the number of clusters to be tested.

### Required: `ncluster`: the number of clusters

`ncluster` is an argument required for `segclust()`. The user can provide a
vector of values or a single value. The algorithm provides a BIC-criterium to
select the number of clusters, although it is advised to select it based on
biological knowledge.

## Running `segclust()`

### Different running configurations

In this section we will try different ways of calling the `segclust()` function.
Minimal call only contains the data concerned (here the dataframe `simulmode`)
as well as arguments `lmin`, `ncluster` and `seg.var`. The function will
automatically complete the missing arguments with default values and communicate
about it (not shown here).


```{r simulmode segclust default, fig.show='hold', message = FALSE}
mode_segclust <- segclust(simulmode,
                          Kmax = 20, lmin=10, 
                          ncluster = c(2,3),
                          seg.var = c("speed","abs_spatial_angle"))
```

Additionally it is advised to scale the variables used to a mean of 0 and a
variance of 1. This will automatically be done for `segclust()`, but can also be
specified explicitly:


```{r simulmode segclust scale variable, fig.show='hold', eval = FALSE}
mode_segclust <- segclust(simulmode,
                          Kmax = 20, lmin=10, 
                          ncluster = c(2,3),
                          seg.var = c("speed","abs_spatial_angle"),
                          scale.variable = TRUE)
```


### Selecting the number of clusters and segments.


Once the segmentation/clustering have been successfully run. A summary like this will appear:
```{r segclust summary, echo = FALSE, results = 'asis'}
  cli::cli_alert_success(
    "Best segmentation/clustering estimated with \\
    {mode_segclust$ncluster.BIC} clusters and \\
    {mode_segclust$Kopt.BIC[mode_segclust$ncluster.BIC]} segments according to BIC")
  cli::cli_text(cli::col_grey(
    '{cli::symbol$arrow_right} Number of clusters should preferentially be selected 
    according to biological knowledge. Exploring the BIC plot with plot_BIC()
    can also provide advice to select the number of clusters.'))
  cli::cli_text(cli::col_grey(
    '{cli::symbol$arrow_right} Once number of clusters is selected, \\
    the number of segments can be selected according to BIC.'))
  
  cli::cli_text(cli::col_grey(
    '{cli::symbol$arrow_right} Results of the segmentation/clustering
    may further be explored with plot() and segmap()'))
  
```

As stated, the number of clusters should preferentially be selected based on
biological knowledge. Exploration of the BIC-based penalized log-likelihood can
also help selecting an appropriate number of clusters, with the function
`plot_BIC()`. Best-case scenario is as below, the BIC shows a steep increase up
to a maximum and a slow decrease after the optimum and one number of clusters is
clearly above the others.


```{r plot_BIC, fig.show='hold', message = FALSE}
plot_BIC(mode_segclust)
```

With real data, a larger number of clusters almost always improves the
penalized-likelihood so it will generally be a poor indication of the
appropriate number of clusters. Such situation may be shown with our dataset
`simulmode` indeed, if we allow the number of clusters to be tested between 2 and
5 instead of a maximum at three, we obtain the following BIC curve:

```{r simulmode plot BIC cluster 2-5, fig.show='hold', message = FALSE}
mode_segclust <- segclust(simulmode,
                          Kmax = 20, lmin=10, 
                          ncluster = 2:5,
                          seg.var = c("speed","abs_spatial_angle"),
                          scale.variable = TRUE)

plot_BIC(mode_segclust)
```

In the curve above we see that the optimum number of clusters selected is 4,
despite the difference being relatively small compared to `ncluster = 3` and to
the evolution with the number of segments. In addition, if we look at the results
of the segmentation we can see identical segment distribution, the only
difference is that state 2 in the segmentation with 3 clusters is divided into
state 2 and 3 in the segmentation with 4 clusters:

```{r 3 clusters, fig.show='hold', message = FALSE}
plot(mode_segclust, ncluster = 3)
```


```{r 4 clusters, fig.show='hold', message = FALSE}
plot(mode_segclust, ncluster = 4)
```

More generally, as in this example, if the selected number of segments for a
higher number of clusters is the same, then the lower number of clusters should be
favored.

Once the number of clusters have been selected, the number of segments can be
selected with the BIC criterium. When the segmentation-clustering is reliable,
the selected optimum should be a maximum just before a linear drop of the
penalized log-Likelihood as in the examples above. With real data the signal is
seldom as clear and may require adjustment based on exploration of the realized
segmentation/clustering.

