---
title: "Getting Started with glyrepr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with glyrepr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

R gives us many data structures for different jobs.
For example, `c(1L, 2L, 3L)` represents an integer vector,
`c("a", "b", "c")` represents a character vector,
and data frames represent tabular data.
Before we can inspect or manipulate a kind of data well,
we first need a suitable structure for representing it.

What about glycans?

When we talk about glycans, we usually talk about two related things:
their compositions and their structures.
A composition can be represented as a named vector,
for example `c(Hex = 3, HexNAc = 2, Neu5Ac = 1)`.
A structure can be represented as a graph,
for example with the `igraph` package.
Those representations are useful, but managing the details by hand quickly becomes cumbersome,
especially for complex glycans or large datasets.
This is where the `glyrepr` package comes in.

`glyrepr` provides two main vector types:
`glycan_composition()` and `glycan_structure()`.
These vectors are designed to feel like ordinary R vectors:
you can subset them, concatenate them, sort them, put them in tibbles,
and use them in vectorized workflows.

These two representations are the foundation of `glycoverse`.
Most higher-level packages in the ecosystem build on them,
so it is worth taking a little time to get comfortable here.

```{r setup}
library(glyrepr)
```

## Glycan Composition Vectors

Let's start with glycan composition vectors.

### Creating Glycan Composition Vectors

A glycan composition vector can be created with either `glycan_composition()` or `as_glycan_composition()`.

`glycan_composition()` is the direct constructor.
It takes one or more named vectors, where the names are monosaccharides and the values are counts.

```{r}
comps <- glycan_composition(
  c(Man = 5, GlcNAc = 2),
  c(Man = 3, Gal = 2, GlcNAc = 4),
  c(Man = 3, Gal = 2, GlcNAc = 4, Neu5Ac = 1, Fuc = 1)
)
comps
```

This creates a glycan composition vector called `comps` with three glycan compositions.

`as_glycan_composition()` is more flexible and can convert several input types.

From a list of named vectors:

```{r}
as_glycan_composition(list(
  c(Man = 5, GlcNAc = 2),
  c(Man = 3, Gal = 2, GlcNAc = 4),
  c(Man = 3, Gal = 2, GlcNAc = 4, Neu5Ac = 1, Fuc = 1)
))
```

This looks similar to `glycan_composition()`,
but it is more convenient when your compositions are already stored in a list,
for example after reading or generating them programmatically.

```{r}
comp_list <- list(
  c(Man = 5, GlcNAc = 2),
  c(Man = 3, Gal = 2, GlcNAc = 4),
  c(Man = 3, Gal = 2, GlcNAc = 4, Neu5Ac = 1, Fuc = 1)
)
as_glycan_composition(comp_list)
```

From a character vector:

```{r}
as_glycan_composition(c("H5N2", "Hex(3)HexNAc(2)"))
```

Both compact notation (`"H5N2"`) and Byonic-style notation (`"Hex(3)HexNAc(2)"`) are supported.

From a glycan structure vector, which we will discuss below:

```{r}
strucs <- c(o_glycan_core_1(), o_glycan_core_2())
as_glycan_composition(strucs)
```

### Two Types of Monosaccharide Names

Before we move on, let's briefly discuss the two types of monosaccharide names used in `glyrepr`:

1. **Generic names**: Hex, HexNAc, dHex, etc.
2. **Specific names**: Man, GlcNAc, Fuc, etc.

Generic names are common in mass spectrometry data,
where it is often difficult to distinguish isomers from MS evidence alone.
For example, one `Hex` residue could be Man, Gal, Glc, or another hexose.
Specific names carry more biological detail and are common in glycan databases and literature.

`glyrepr` supports both types of names, but you cannot mix them in the same vector.

```{r}
# This raises an error because the monosaccharide names are mixed.
try(as_glycan_composition(c("Hex(5)HexNAc(2)", "Man(5)GlcNAc(2)")), silent = TRUE)
```

The same rule applies to glycan structure vectors.

### Inspecting Glycan Composition Vectors

The main function for inspecting glycan compositions is `count_mono()`.
Let's demonstrate it using the `comps` vector we created earlier.

```{r}
comps
```

The first argument is a glycan composition vector, and the second argument is the monosaccharide you want to count.

```{r}
count_mono(comps, "Man")
```

```{r}
count_mono(comps, "Neu5Ac")
```

`count_mono()` works with both generic and specific monosaccharide names.
It understands the relationship between them, so it can count at the level you ask for.

```{r}
count_mono(comps, "Hex")
```

Note that both "Man" and "Gal" are counted as "Hex" here.

You can also omit the second argument to get the total monosaccharide count for each composition.

```{r}
count_mono(comps)
```

### Manipulating Glycan Composition Vectors

One useful mental model for glycan composition vectors
(and glycan structure vectors) is that they behave like atomic vectors.
This means they support familiar operations like subsetting, concatenation, and sorting.

**Concatenation**:

```{r}
c(comps, comps)
```

**Subsetting**:

```{r}
comps[1:2]
```

```{r}
comps[integer()]
```

**Length**:

```{r}
length(comps)
```

**Unique**:

```{r}
dup_comps <- c(comps, comps)
dup_comps
```

```{r}
unique(dup_comps)
```

**Repeated**:

```{r}
rep(comps, times = 2)
```

**Sorting**:

```{r}
sort(comps)
```

```{r}
sort(comps, decreasing = TRUE)
```

### Working with Tibbles

One of the most useful features of glycan composition vectors is that they work smoothly with tibbles and data frames.

```{r}
library(tibble)

tb <- tibble(
  id = c("glycan1", "glycan2", "glycan3"),
  composition = comps
)
tb
```

You can use `tidyverse` functions to perform operations on the glycan composition column.

```{r}
library(dplyr)

tb |>
  mutate(n_sia = count_mono(composition, "Neu5Ac")) |>
  filter(n_sia > 0)
```

### Missing Values and Names

Glycan composition vectors can also handle missing values.

```{r}
comps_with_na <- glycan_composition(c(Man = 5, GlcNAc = 2), NA)
comps_with_na
```

```{r}
count_mono(comps_with_na, "Man")
```

For technical reasons, glycan composition vectors cannot have element names right now.
In practice, this is usually fine:
the composition itself remains the data value,
and identifiers can live in a separate tibble column.

## Glycan Structure Vectors

Now let's move on to the core of glycan representation:
glycan structure vectors.
Many of the same ideas apply,
including the atomic vector nature, vectorized operations, and seamless integration with tibbles.
Because structures are more complex than compositions,
there are a few additional features and considerations to keep in mind.

### Creating Glycan Structure Vectors

As with glycan composition vectors,
you can create glycan structure vectors with either `glycan_structure()` or `as_glycan_structure()`.

Under the hood, glycan structures are represented as `igraph` objects,
because glycans are naturally graph-like.
Most of the time, you do not need to work with those graphs directly:
`glyrepr` gives you a higher-level interface for common structure operations.

`glycan_structure()` is the direct constructor and takes one or more `igraph` objects.
Creating those graphs manually can be tedious,
so you will usually start with `as_glycan_structure()` instead.
It can parse IUPAC-condensed strings into glycan structure vectors.

If you are not familiar with IUPAC-condensed notation,
check out this [article](https://glycoverse.github.io/glyrepr/articles/iupac.html) for a quick introduction.
We recommend getting familiar with this notation,
because it is the main text language for communicating glycan structures in `glycoverse`.

Parsing IUPAC-condensed strings is straightforward with `as_glycan_structure()`.
(For parsing other formats like GlycoCT, use the [glyparse](https://github.com/glycoverse/glyparse) package.)

```{r}
strucs <- as_glycan_structure(c(
  "Gal(b1-3)GalNAc(a1-",
  "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-"
))
strucs
```

### Inspecting Glycan Structure Vectors

`strucs` prints a bit like a character vector with colors,
but under the hood the glycan structures are stored as `igraph` objects.
You can access the underlying `igraph` objects using `get_structure_graphs()`.

```{r}
get_structure_graphs(strucs)
```

Details of how a glycan structure is modeled as a graph are covered in the [glycan graph vignette](https://glycoverse.github.io/glyrepr/articles/glycan-graph.html).
You do not need all of those details for everyday use,
but a quick skim can make the structure functions easier to understand.

The `count_mono()` function also works with glycan structure vectors.

```{r}
count_mono(strucs, "Gal")
```

Other functions, including `has_linkages()`, `get_mono_type()`, and `get_structure_level()`,
inspect specific aspects of the structures.

```{r}
# This function works element-wise
has_linkages(strucs)
```

```{r}
get_mono_type(strucs)
```

```{r}
get_structure_level(strucs)
```

### Structure Levels

The structures we have seen so far are "intact" structures.
They contain specific monosaccharides and complete linkage information.
In real datasets, though, glycan structures often have missing information.
For example, we might know the topology but not the linkages,
or we might only know generic monosaccharide classes.

To accommodate these scenarios, `glyrepr` defines four levels of glycan structures:

- **Intact**: specific monosaccharides and complete linkage information.
- **Partial**: specific monosaccharides, with at least one missing linkage annotation.
- **Topological**: specific monosaccharides, but all linkage information is missing.
- **Basic**: generic monosaccharides, with linkage information treated as missing.

Structure levels are defined at the vector level,
so one glycan structure vector has one level.

Let's see some examples.

**Intact structures**:

```{r}
as_glycan_structure(c(
  "Gal(b1-3)GalNAc(a1-",
  "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-"
)) |> get_structure_level()
```

**Partial structures**:

```{r}
as_glycan_structure(c(
  "Gal(b1-?)GalNAc(a1-",
  "Gal(b1-?)[GlcNAc(b1-6)]GalNAc(a1-"
)) |> get_structure_level()
```

**Topological structures**:

```{r}
as_glycan_structure(c(
  "Gal(??-?)GalNAc(??-",
  "Gal(??-?)[GlcNAc(??-?)]GalNAc(??-"
)) |> get_structure_level()
```

**Basic structures**:

```{r}
as_glycan_structure(c(
  "Hex(??-?)HexNAc(??-",
  "Hex(??-?)[HexNAc(??-?)]HexNAc(??-"
)) |> get_structure_level()
```

In theory, you can have something like `"Hex(b1-3)HexNAc(a1-"`,
with generic monosaccharides but all linkages intact.
In practice, linkage information is usually harder to obtain than monosaccharide identities,
so this situation is rare.

If you create such a vector, `glyrepr` classifies it as `"basic"` and warns you.

```{r}
as_glycan_structure(c(
  "Hex(a1-3)HexNAc(a1-",
  "Hex(a1-3)[HexNAc(b1-6)]HexNAc(a1-"
)) |> get_structure_level()
```

### Manipulating Glycan Structure Vectors

Glycan structure vectors also support vectorized operations like subsetting and concatenation.
They also have structure-specific helpers for reducing resolution,
removing linkages, and removing substituents.

```{r}
strucs
```

```{r}
reduce_structure_level(strucs, to_level = "basic")
```

```{r}
remove_linkages(strucs)  # same as reduce_structure_level(strucs, to_level = "topological")
```

```{r}
strucs_with_subs <- as_glycan_structure(c(
  "Gal6S(b1-3)GalNAc(a1-",
  "Gal6S(b1-3)[GlcNAc(b1-6)]GalNAc(a1-"
))
remove_substituents(strucs_with_subs)
```

### Missing Values and Names

Like glycan composition vectors, glycan structure vectors can also handle missing values.

Different from glycan composition vectors, though, glycan structure vectors can have names.
This is a very useful feature we introduced in version 0.10.0.

```{r}
strings <- c(
  glycan1 = "Gal(b1-3)GalNAc(a1-",
  glycan2 = "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-"
)
as_glycan_structure(strings)
```

You will find the names useful when working with the `glymotif` package.

## What's Next?

This vignette sets the foundation for working with glycan representations in `glycoverse`.
From here, you can explore the packages that build on these representations:

- [glyparse](https://github.com/glycoverse/glyparse): parsing text nomenclatures into glycan representations.
- [glymotif](https://github.com/glycoverse/glymotif): identifying and analyzing glycan motifs.
- [glydet](https://github.com/glycoverse/glydet): calculating derived traits and quantifying motifs.
- [glydraw](https://github.com/glycoverse/glydraw): visualizing glycan structures with SNFG notation.
- [glyenzy](https://github.com/glycoverse/glyenzy): inspecting and modeling glycan biosynthesis.

Happy exploring.