---
title: "Using grouped sequence data with tna"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using grouped sequence data with tna}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 6,
  fig.height = 4,
  out.width = "100%",
  dev = "jpeg",
  dpi = 100,
  comment = "#>"
)
suppressPackageStartupMessages({
  library("tna")
  library("tibble")
  library("dplyr")
  library("gt")
})
```

TNA supports the analysis of transition networks constructed from grouped sequence data. Groups can be defined in several ways, but mainly: using a pre-existing grouping variable in the data (e.g., a demographic or experimental condition), or by clustering the sequences themselves based on their similarity. This vignette demonstrates both approaches using the `group_regulation_long` dataset.

First, we load the packages we will use for this example.

```{r, eval = FALSE}
library("tna")
library("tibble")
library("dplyr")
library("gt")
```

## Data preparation

We import the data in long format and prepare it for analysis. The `prepare_data()` function converts the long-format event log into wide-format sequences. The unused columns in the dataset are stored in the metadata of `prepared` and we can use them later on.

```{r, message = F}
data("group_regulation_long", package = "tna")
prepared <- prepare_data(group_regulation_long,
                         actor = "Actor",
                         action = "Action",
                         time = "Time")
```

## Groups from a pre-existing variable

When the data contains a grouping variable, we can build separate TNA models for each group directly. Here, the `"Achiever"` column in the metadata splits the sequences into two groups (high vs. low achievers).

```{r, fig.width=9, fig.height=4}
layout(t(1:2))
achievers <- group_tna(prepared, group = "Achiever")
plot(achievers)
```

### Comparing groups

The `plot_compare()` function visualizes the difference network between the two groups. Green edges and donut segments indicate that the first group (High achievers) has higher values, while red indicates the opposite (Low achievers have higher values).

```{r}
plot_compare(achievers)
```

### Permutation test

A permutation test can be used to assess whether the observed differences between the two groups are statistically significant.

```{r}
permutation_test_results <- permutation_test(achievers)
plot(permutation_test_results)
```

### Subsequence comparison

We can also compare the frequency of subsequences across groups. Here we look at subsequences of length 3 to 5, keeping only those that appear at least 5 times, and apply FDR correction for multiple comparisons.

```{r}
subsequence_comparison  <- compare_sequences(achievers,
                                                  sub = 3:5,
                                                  min_freq = 5,
                                                  correction = "fdr")
plot(subsequence_comparison, cells = TRUE)
```

## Groups from sequence clustering

When no pre-existing grouping variable is available, we can cluster the sequences based on their pairwise dissimilarity. The `cluster_sequences()`.

```{r}
clustering_results <- cluster_sequences(prepared, k = 3)
```

To choose an appropriate number of clusters, we can plot the silhouette score for different values of *k*. Higher silhouette values indicate better-separated clusters.

```{r}
plot(
  2:8,
  sapply(2:8, \(k) cluster_sequences(prepared, k = k)$silhouette),
  type = "b",
  xlab = "Number of clusters (k)",
  ylab = "Silhouette",
  xaxt = "n"
)
```

Once we have chosen *k*, we build the grouped TNA model using the cluster assignments.

```{r}
tna_model_clus <- group_tna(prepared, group = clustering_results$assignments)
```

```{r, fig.width=9, fig.height=9}
layout(matrix(1:4, byrow = T, ncol = 2))
plot(tna_model_clus)
```

### Summarizing the cluster-specific models

We can summarize the cluster-specific models to compare their overall characteristics.

```{r}
summary(tna_model_clus) |>
  gt() |>
  fmt_number(decimals = 2)
```

Initial probabilities show which states are most common at the start of the sequences in each cluster.

```{r}
mat <- sapply(
  tna_model_clus,
  \(x) setNames(x$inits, x$labels)
)

df <- data.frame(label = rownames(mat), mat, row.names = NULL)

gt(df, rowname_col = "label") |> fmt_percent(columns = -label)
```

The full transition probability matrices can also be inspected for each cluster.

```{r}
transitions <- lapply(
  tna_model_clus,
  function(x) {
    x$weights |>
      data.frame() |>
      rownames_to_column("From\\To") |>
      gt() |>
      fmt_percent()
  }
)

transitions[[1]] |> tab_header(title = names(tna_model_clus)[1])
transitions[[2]] |> tab_header(title = names(tna_model_clus)[2])
transitions[[3]] |> tab_header(title = names(tna_model_clus)[3])
```

### Pruning with bootstrap

Just like ordinary TNA models, we can retain only the statistically robust edges.

```{r}
cluster_boot <- bootstrap(tna_model_clus)
```

```{r, fig.width=9, fig.height=9, message = F}
layout(matrix(1:4, byrow = T, ncol = 2))
plot(cluster_boot)
```

### Centrality measures

Centrality measures can be computed for each cluster to identify which states play central roles in each group's transition dynamics.

```{r, fig.width=9, fig.height=4}
centrality_measures <- c(
  "BetweennessRSP",
  "Closeness",
  "InStrength",
  "OutStrength"
)
centralities_per_cluster <- centralities(
  tna_model_clus,
  measures = centrality_measures
)
plot(
  centralities_per_cluster, ncol = 4,
  colors = c("purple", "orange", "pink")
)
```

### Subsequence comparison across clusters

Finally, we can compare subsequence frequencies across the clusters, just as we did for the pre-existing groups above.

```{r}
subsequence_comparison  <- compare_sequences(tna_model_clus, sub = 3:5, min_freq = 5, correction = "fdr")
plot(subsequence_comparison, cells = TRUE)
```

