---
title: "Introduction to tidyGenR"
author: "Miguel Camacho-Sanchez"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{tidyGenR-intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r options, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

*tidyGenR* core functions cover all the steps necessary to determine sequence variants and genotypes from sequencing reads coming from **multilocus amplicon** libraries sequenced by high throughput sequencing technologies.

This vignette contains a standard workflow for paired-end data.

```{r setup}
library(tidyGenR)
```

# Input raw sequences

The raw and processed data used in this vignette come from amplicon libraries of *Rattus baluensis* (Camacho-Sanchez et al. 2026).

Naming of FASTQ must meet a given format.

```{r}
# download test raw data (if not downloaded)
raw <-
  system.file("extdata", "raw", package = "tidyGenR")
freads <- list.files(raw,
    pattern = "1.fastq",
    full.names = TRUE)
rreads <- list.files(raw,
    pattern = "2.fastq",
    full.names = TRUE)
```

# Demultiplex loci

Reads from different loci are mixed together in the same sample FASTQ. The known primer sequences at the ends of amplicons can be used to demultiplex reads by locus using `demultiplex()`. This function produces an executable for [cutadapt](https://cutadapt.readthedocs.io/en/stable). A `data.frame` with primer sequences can be used to feed `demultiplex()` with primer sequences.

```{r primers}
# data.frame with primers
data("primers")
head(primers, 3)
```

```{r demultiplex, results = "hide", message=FALSE}
p_sh_out <- tempfile(fileext = ".sh")
# demultiplex reads by locus using 3 primer pairs
demultiplex(
    freads = freads,
    rreads = rreads,
    primers = primers[1:3, ],
    sh_out = p_sh_out,
    run = FALSE
)
```

`demultiplex()` creates a script as below:

```{r cutadapt_script}
readLines(p_sh_out)
```

If problems arise when running `cutadapt` `run = TRUE`, the script written to `sh_out` can be edited manually and run outside R. FASTQ with zero or few reads can be removed using `remove_poor_fastq()`

# Truncate reads

Variants are determined on the basis of 1-nt resolution. Thus, reads to be compared need to have the same length and minimum quality stantards (eg. no 'Ns') (see DADA2 documentation, [Callahan et al. 2016](https://doi.org/10.1038/nmeth.3869)). Sequence quality can be checked with FASTQC, `dada2::plotQualityProfile()` or other QC checking software to guide truncation lengths. Truncation and minimal quality filtering are performed with `trunc_amp()`, a wrapper function of `dada2::filterAndTrim()`. Global or locus-specific truncation lengths can be supplied in a `data.frame`.

```{r truncate_one_locus}
# Download per-locus demultiplexed FASTQ
dem <-
  system.file("extdata", "demultiplexed", package = "tidyGenR")
# truncate reads for one locus
p_trun <- file.path(tempdir(), "truncated")
one_locus <-
    trunc_amp(
        loci = "chrna9",
        mode_trun = "pe",
        in_dir = dem,
        fw_pattern = "1.fastq.gz",
        rv_pattern = "2.fastq.gz",
        trunc_fr = c(250, 180),
        max_ee = c(3, 3),
        outdir = p_trun
    )
```

It is possible to use locus-specific truncation lengths specified in a `data.frame`.

```{r trunc_fr}
# dataframe with locus-specific truncation lengths
data("trunc_fr")

head(trunc_fr, 3)
```

```{r truncate_all_loci, results="hide", message=FALSE}
# truncate reads for all loci using locus-specific truncation lengths
all_loci <-
    trunc_amp(
        mode_trun = "pe",
        in_dir = dem,
        fw_pattern = "1.fastq.gz",
        rv_pattern = "2.fastq.gz",
        trunc_fr = trunc_fr,
        max_ee = c(3, 3),
        outdir = p_trun
    )
```

`trunc_amp()` writes truncated FASTQ to a directory and returns a list with in and out reads.

Truncated FASTQ:

```{r truncated_fastq}
head(list.files(p_trun))
```

IN and OUT reads after `trunc_amp()`:

```{r trunc_in_out}
head(all_loci, 3)
```

# Variant calling

Real sequence variants and their frequency are determined with `variant_call()`. It utilizes the 'DADA2' ASV determination pipeline, but for multiple loci. Determination of variants and potentially, merging paired-end reads, and removal of chimeras are performed altogether with `variant_call()`. Binned read qualities (i.e. from NovaSeq) require the use of `error_function = loess_err_mod4` (details in `?loess_err_mod4()`). Variants are returned in `tidy` format.

```{r variant_calling, message=FALSE, results="hide"}
variants <- variant_call(in_dir = p_trun)
head(variants)
```
```{r head_variants}
head(variants)
```

Variants can be further filtered with `filter_variants()` based on frequency (`maf`) and depth (`ad`) thresholds.

# Genotype

Genotypes can be determined from variants. So far, the implemented method for genotyping is based on the expected maximum number of alleles per sample and locus and the read threshold (`ADt`) for discriminating homozygotes from hemizygotes.

```{r genotype, message = FALSE}
genotypes <-
    genotype(variants, ADt = 10, ploidy = 2)
head(genotypes)
```
