---
title: "Introduction to scPloidy"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

has_pkg = requireNamespace("GenomicRanges", quietly = TRUE) &&
  requireNamespace("IRanges", quietly = TRUE) &&
  requireNamespace("readr", quietly = TRUE)
knitr::opts_chunk$set(eval = has_pkg)
```

## Introduction

`scPloidy` is a package to compute ploidy of single cells (or nuclei) based on
single-cell (or single-nucleus) ATAC-seq data.
In ATAC-seq, open chromatin regions are excised and sequenced.
For any site on the genome, ATAC-seq could read 0, 1 or 2 DNA fragments,
if the cell was diploid.
If the cell was tetraploid, ATAC-seq could read 0, 1, 2, 3 or 4 fragments from the same site.
This is the basic idea used in `scPloidy`.
We model the depth of DNA sequencing at one site by binomial distribution.

## Usage

```{r setup}
library(scPloidy)
library(GenomicRanges)
library(IRanges)
library(readr)
```

See description.

```{r}
?fragmentoverlapcount
?ploidy
```

## Analyzing sample data

In this section, we demonstrate the package by using a dataset included in the package.
See next section on how to analyze your own data.

### Compute overlaps of DNA fragments

We use small dataset for single-nucleus ATAC-seq of rat liver.
To limit the file size, it only includes 10 cells and fragments from chromosomes 19 and 20.
The fragments were mapped to the rat rn7 genome.

We first set the regions where overlaps are counted.
This is usually all of the autosomes.

```{r}
targetregions =
  GenomicRanges::GRanges(
    c("chr19", "chr20"),
    IRanges::IRanges(c(1, 1), width = 500000000))
```

Simple repeats in the genome can generate false overlaps.
We exclude such regions.
The regions were downloaded from USCS genome browser.

```{r}
simpleRepeat = readr::read_tsv(
  system.file("extdata", "simpleRepeat.chr19_20.txt.gz", package = "scPloidy"),
  col_names = c("chrom", "chromStart", "chromEnd"))
simpleRepeat[, 2] = simpleRepeat[, 2] + 1 # convert from 0-based position to 1-based
simpleRepeat = GenomicRanges::makeGRangesFromDataFrame(
  as.data.frame(simpleRepeat),
  seqnames.field = "chrom",
  start.field    = "chromStart",
  end.field      = "chromEnd")
```

Now compute the overlaps.
The input file `SHR_m154211.10cells.chr19_20.fragments.txt.gz`
records the fragments sequenced in ATAC-seq.

```{r}
fragmentoverlap =
  fragmentoverlapcount(
    system.file("extdata", "SHR_m154211.10cells.chr19_20.fragments.txt.gz", package = "scPloidy"),
    targetregions,
    excluderegions = simpleRepeat,
    Tn5offset = c(4, -5))
fragmentoverlap
rm(fragmentoverlap)
```

### Infer ploidy

Instead of the small dataset computed above,
we here use a complete dataset for single-nucleus ATAC-seq of a rat liver.

```{r}
data(SHR_m154211)
?SHR_m154211
fragmentoverlap = SHR_m154211$fragmentoverlap
fragmentoverlap
```

Compute ploidy, assuming a mixture of 2x (diploid), 4x (tetraploid) and 8x cells.
We recommend using `ploidy.moment` which is based on moment method.

```{r}
p = ploidy(fragmentoverlap,
           c(2, 4, 8))
head(p)
```

The cell type of the cells are stored in dataframe `cells`.
There are many hepatocytes of 4x and 8x, but other cell types are mostly 2x.

```{r}
cells = SHR_m154211$cells
table(cells$celltype,
      p$ploidy.moment[match(cells$barcode, p$barcode)])
```

## Analyzing your own data

### Using fragments.tsv.gz generated by 10x Cell Ranger

In the Cell Ranger output directory, you should have files
`fragments.tsv.gz` and `fragments.tsv.gz.tbi`.
The file `fragments.tsv.gz` can be processed with the `fragmentoverlapcount` function,
specifying the option `Tn5offset = c(1, 0)`

### Using BAM file

Although this requires several steps, you can start from a BAM file
and generate fragments file for `scPloidy`.
You need to install `samtools`, `bgzip` and `tabix`,
and run the following in your shell.

First generate a BAM file in which the cell barcode is prepended to read name,
separated by ':'.
For example, suppose in your input file `bap.bam`
your barcode `BCxxxxxx` was stored in the field with `DB` tag as
`DB:Z:atac_possorted_bam_BCxxxxxx`.
We generate a BAM file `snap.bam`.

```{bash, eval = FALSE}
# extract the header file
samtools view bap.bam -H > header.sam

# create a bam file with the barcode embedded into the read name
cat <( cat header.sam ) \
<( samtools view bap.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^DB:Z:atac_possorted_bam_BC"){ td[substr($i,1,2)] = substr($i,25,length($i)-24); } }; printf "%s:%s\n", td["DB"], $0 }' ) \
| samtools view -bS - > snap.bam
```

We next sort the reads by barcode,
and obtain the file `snap.nsrt.bam`.

```{bash, eval = FALSE}
samtools sort -n -@ 20 -m 10G snap.bam -o snap.nsrt.bam
```

Finally, we extract fragments from the name-sorted BAM file,
and obtain the file `fragments.txt.gz`.

```{bash, eval = FALSE}
samtools view -f 0x2 -F 0x900 -q 30 snap.nsrt.bam  \
  | samtofragmentbed.pl  \
  | perl -ne 'chomp; @a=split; $a[3]=~s/:.*//; print join("\t",@a), "\n"'  \
  | LC_ALL=C sort -T ./ -S 50% --parallel=12 -k1,1 -k2,3n -k4,4 -t$'\t' | uniq  \
  | bgzip > fragments.txt.gz
tabix -b 2 -e 3 fragments.txt.gz
```

The Perl script `samtofragmentbed.pl` is included in this package
as this file:

```{r, eval = FALSE}
system.file("perl", "samtofragmentbed.pl", package = "scPloidy")
```

The file `fragments.txt.gz` can be processed with the `fragmentoverlapcount` function,
specifying the option `Tn5offset = c(4, -5)`