---
title: "Glycan Graphs: The Network Behind Glycan Structures"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Glycan Graphs: The Network Behind Glycan Structures}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette is for users who want to understand how `glyrepr` stores glycan structures internally.
Some familiarity with graph theory and the `igraph` package will help.
If those concepts are new to you,
the [igraph documentation](https://r.igraph.org) is a useful companion reference.

## Glycans as Graphs

Glycans are naturally represented as directed graphs.
In `glyrepr`, a glycan structure is stored as an outward-directed tree,
where each vertex represents a monosaccharide and each edge represents a glycosidic linkage.

Behind the scenes,
each `glycan_structure()` object is backed by an `igraph` object.
Most workflows can stay at the `glycan_structure()` level,
but the graph representation is useful when you need custom structure analysis.

```{r setup}
library(glyrepr)
```

## What Is Stored in Memory?

A glycan can carry many kinds of information:
linear oriented C-atoms,
basetype,
substituents,
configuration,
anomeric center,
ring size,
linkage positions,
and more.

Some packages, such as Python's `glypy`, store a very detailed glycan model.
That comprehensive strategy is useful for specialized tasks such as MS/MS spectra simulation,
but it can be overkill for everyday omics research.

`glyrepr` takes a more compact approach:
if a feature can be derived from an IUPAC-condensed text representation,
`glyrepr` stores it.
Details such as configuration and ring size are not stored directly,
because they are often predictable for common carbohydrates and are not needed for many glycomics workflows.

For a closer look at IUPAC-condensed notation,
see the [IUPAC-condensed vignette](https://glycoverse.github.io/glyrepr/articles/iupac.html).

## Extracting the Graph

You cannot pass a `glycan_structure()` object directly to most `igraph` functions.
First, extract the underlying graph with `get_structure_graphs()`:

```{r}
glycan <- n_glycan_core()
graph <- get_structure_graphs(glycan)
graph
```

The printed graph contains several pieces of information.

**First line:** Directed Named ("DN") graph with 5 vertices (sugar units) and 4 edges (bonds). 

**Graph-level attributes:**

- `anomer`: the anomeric configuration of the reducing end.

**Vertex attributes:**

- `name`: a unique ID for each monosaccharide.
- `mono`: the monosaccharide type, such as "Hex" or "HexNAc".
- `sub`: chemical decorations attached to the monosaccharide.

**Edge attributes:**

- `linkage`: how the monosaccharides are connected, including bond positions and configurations.

**Connection pattern:** "1->2" means vertex 1 connects to vertex 2. 
`glyrepr` treats bonds as arrows pointing from the core toward the branches.
The direction is a modeling choice that makes traversal and structure operations easier.

You can also plot the graph with `igraph`:

```{r}
plot(graph)
```

## Graph Components

### Vertices

Each vertex represents a monosaccharide with three key properties:

**Names:**
These are auto-generated identifiers, usually simple integers,
but they could be anything as long as they're unique:

```{r}
igraph::V(graph)$name
```

**Monosaccharides:**
These are IUPAC-condensed names like "Hex", 
"HexNAc", 
"Glc", 
"GlcNAc". 

```{r}
igraph::V(graph)$mono
```

For the full list of available monosaccharides,
check [SNFG notation](https://www.ncbi.nlm.nih.gov/glycans/snfg.html) or run `available_monosaccharides()`.

**Substituents:**
Chemical decorations like "Me" (methyl), 
"Ac" (acetyl), 
"S" (sulfate), 
etc. 
Position matters.
"3Me" = methyl at position 3, 
"?S" = sulfate at unknown position:

```{r}
igraph::V(graph)$sub
```

Multiple decorations are comma-separated and sorted by position:

```{r}
glycan2 <- as_glycan_structure("Glc3Me6S(a1-")
graph2 <- get_structure_graphs(glycan2)
igraph::V(graph2)$sub
```

### Edges

Edges represent glycosidic bonds with a simple but powerful format:

```
<target anomeric config><target position> - <source position>
```

Here is an example where "Gal" has an "a" anomeric configuration,
linking from position 3 of "GalNAc" to position 1 of "Gal":

```{r}
glycan3 <- as_glycan_structure("Gal(a1-3)GalNAc(b1-")
graph3 <- get_structure_graphs(glycan3)
igraph::E(graph3)$linkage
```

`glyrepr` stores anomer information in edges rather than vertices.
This follows the way linkages are written in IUPAC-condensed notation,
for example "Neu5Ac with an a2-3 linkage".

### Graph-Level Attributes

**Anomer:** The anomeric configuration of the reducing end.

```{r}
graph$anomer
```

## Working with the Graph

### Using `igraph`

Once you understand the graph structure, 
you can use `igraph` functions for custom structure analysis.

**Example 1:** Count branched structures (sugars with multiple children):

```{r}
sum(igraph::degree(graph, mode = "out") > 1)
```

**Example 2:** Explore the structure with breadth-first search:

```{r}
bfs_result <- igraph::bfs(graph, root = 1, mode = "out")
bfs_result$order
```

### Using `smap` Functions

Working with multiple glycans? You could use `purrr`:

```{r}
library(purrr)

glycans <- c(n_glycan_core(), o_glycan_core_1(), o_glycan_core_2())
graphs <- get_structure_graphs(glycans)  # Extract graphs first
map_int(graphs, ~ igraph::vcount(.x))    # Then analyze
```

For glycan structure vectors,
`glyrepr`'s `smap` functions are usually a better fit:

```{r}
smap_int(glycans, ~ igraph::vcount(.x))  # Direct analysis, no intermediate step.
```

The main advantage of `smap` functions is how they handle duplicates.
Real datasets often contain many repeated structures,
and `smap` optimizes by processing unique structures once, 
then efficiently expanding results back to the original dimensions.

The [smap vignette](https://glycoverse.github.io/glyrepr/articles/smap.html) covers this workflow in more detail.

### Motif Analysis with `glymotif`

One important application is identifying biologically meaningful motifs, or functional substructures.
The `glymotif` package, 
built on this graph foundation, 
specializes in exactly this task.

See the [`glymotif` introduction](https://glycoverse.github.io/glymotif/articles/glymotif.html) for examples.

## Summary

In this vignette, you saw:

- how glycan structures map to directed graphs.
- what information is stored and what is deliberately omitted.
- how to extract and inspect the underlying graphs.
- how `igraph`, `smap`, and `glymotif` can build on this representation.

The graph representation might seem complex at first, 
but it's this solid foundation that enables all the sophisticated glycan analysis capabilities in the `glycoverse`. 
Most users will not need to manipulate the graph directly,
but understanding the model makes it easier to extend `glyrepr` when custom analysis is needed.
