---
title: "Genomics"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
editor_options:
  markdown:
    wrap: 80
    canonical: true
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  include = TRUE
)
```

# Introduction

This vignette demonstrates how SIMplyBee manages and manipulates the honey bee's
genomic information. Specifically, it describes:

-   how to obtain the genomic information,
-   how to pool genotypes, and
-   how to compute genomic relationship matrices.

Let's first create a colony.

```{r load package}
library(package = "SIMplyBee")
founderGenomes <- quickHaplo(nInd = 2, nChr = 3, segSites = 100)
SP <- SimParamBee$new(founderGenomes)
SP$setTrackRec(TRUE) # request recombination tracking

baseQueens <- createVirginQueens(founderGenomes)
baseDrones <- createDrones(x = baseQueens[1], nInd = 15)

colony <- createColony(x = baseQueens[2])
colony <- cross(colony, drones = baseDrones, checkCross = "warning")
colony <- buildUp(colony)
```

# Obtaining genomic information

Honeybees have a haplo-diploid inheritance system where queens and workers are
diploid and drones are haploid. In SIMplyBee, we simulate drones as
doubled-haploids, that is, as fully homozygous diploid individuals. This means
that they have two identical sets of chromosomes. When they produce sperm, their
gametes all have the same one set of chromosomes. Despite them being diploid, we
generally return a haploid set of chromosomes from drones, unless specifically
requested that you want the doubled-haploid genotype.

Following AlphaSimR, SIMplybee has a group of genome retrieval functions
`get*Haplo/Geno()` which extract haplotypes and genotypes for all segregating
sites (`SegSites`), quantitative trait loci (`QTL`), markers (`SNP`), and the
identical by descent (`IBD`) haplotypes. Here, site, locus and marker are all
synonyms for a position in the genome. These functions leverage AlphaSimR
functionality, but work with SIMplyBee's `Colony` or `MultiColony` objects and
in addition take the `caste` argument to extract information for a specific
caste. Another argument you can use with this function is
`collapse = TRUE/FALSE`. If `collapse = TRUE` then all of the information is
collapsed together and a single matrix is returned, if `collapse = FALSE` we
return a list by caste or by colony.

We recommend that you study the index of available `get*()` functions in
SIMplyBee and read this vignette for a short demonstration: `help(SIMplyBee)`.

To show all this functionality, let's get haplotypes and genotypes across the
segregating sites for the different castes using `getSegSitesGeno()` or
`getSegSitesHaplo()`. The first row of the output shows marker identifications
(chromosome_locus) and the first column shows haplotype identifications
(individual_haplotype). The alleles are represented with a sequence of 0's and
1's. Let's first obtain the information at the segregating sites for the queen
(we limit the output to the first 10 sites):

```{r queens haplo}
getSegSiteHaplo(colony, caste = "queen")[, 1:10]
```

```{r queens geno}
getSegSiteGeno(colony, caste = "queen")[, 1:10]
```

Now for the fathers:

```{r fathers haplo}
getSegSiteHaplo(colony, caste = "fathers")[, 1:10]
```

```{r fathers geno}
getSegSiteGeno(colony, caste = "fathers")[, 1:10]
```

Since father are drones, and these are haploid, we get one row per father. We
can retrieve the doublet-haploid (diploid implementation) state, if this is
desired (showing just one father to show this clearly):

```{r fathers haplo 2}
getSegSiteHaplo(colony, caste = "fathers", 
                nInd = 1, dronesHaploid = FALSE)[, 1:10]
```

```{r fathers geno 2}
getSegSiteGeno(colony, caste = "fathers", 
               nInd = 1, dronesHaploid = FALSE)[, 1:10, drop = FALSE]
```

Now two workers:

```{r workers haplo}
getSegSiteHaplo(colony, caste = "workers", nInd = 2)[, 1:10]
```

```{r workers geno}
getSegSiteGeno(colony, caste = "workers", nInd = 2)[, 1:10]
```

And finally four drones:

```{r drones haplo}
getSegSiteHaplo(colony, caste = "drones", nInd = 4)[, 1:10]
```

```{r drones geno}
getSegSiteGeno(colony, caste = "drones", nInd = 4)[, 1:10]
```

You can also use `caste = "all"` to get the haplotypes and phenotypes from every
individual in the colony. If the argument `collapse` is set to `FALSE`, then the
function returns a list with haplotypes for each caste. Let's explore the
structure of the output:

```{r Colony haplo}
str(getSegSiteHaplo(colony, caste = "all", collapse = FALSE))
```

If the argument `collapse` is set to `TRUE`, the function returns a single
matrix with haplotypes of all the individuals. The same behaviour is implemented
for all the functions that extract genomic information

```{r}
str(getSegSiteHaplo(colony, caste = "all", collapse = TRUE))
```

```{r}
getSegSiteHaplo(colony, caste = "all", collapse = TRUE)[1:10, 1:10]
```

```{r all geno}
getSegSiteGeno(colony, caste = "all", collapse = TRUE)[1:10, 1:10]
```

SIMplyBee also has shortcuts for these haplotype and genotype functions to make
life a bit easier for the user:

-   `getQueenSegSitesHaplo()`

-   `getQueenSegSitesGeno()`

-   `getFathersSegSitesHaplo()`

-   `getFathersSegSitesGeno()`

-   `getWorkersSegSitesHaplo()`

-   `getWorkersSegSitesGeno()`

-   `getDronesSegSitesHaplo()`

-   `getDronesSegSitesGeno()`

-   `getVirginQueensSegSitesHaplo()`

-   `getVriginQueensSegSitesGeno()`

Similar aliases exist also for extracting information about quantitative trait
loci (`QTL`), markers (`SNP`), and the identical by descent (`IBD`) haplotypes.

# Pooling genotypic information

Unfortunately, in real life it's challenging to get the genotype of every
individual honeybee and so SIMplyBee provides the function `getPooledGeno()` to
imitate real life data. `getPooledGeno()` returns a pooled genotype from
individual genotypes to mimic the genotyping of a pool of colony members. A
comparison of pooled and individual genotypes also allows the user to compare
the two and see the impact of pooled samples on results.

Firstly let's obtain the genotypes of the workers and of the queen so that
they're easier to work with:

```{r assign genotypes of drones and queens }
genoQ <- getSegSiteGeno(colony, caste = "queen")
genoW <- getSegSiteGeno(colony, caste = "workers")
```

The function `getPooledGeno()` required also the sex of individuals whose
genotype are getting pooled (`F` for females and `M` for males).

```{r get drones sex}
sexW <- getCasteSex(colony, caste = "workers")
```

You have two options when choosing what kind of pooled genotypes you would like,
using the `type =` argument. You can use `type = "mean"` for the average
genotypes and `type = "count"` for the counts of reference and alternative
alleles.

```{r pooled geno count}
getPooledGeno(x = genoW, type = "count", sex = sexW)[, 1:10]
```

```{r pooled geno mean}
(poolW <- getPooledGeno(x = genoW, type = "mean", sex = sexW))[, 1:10]
```

Now lets plot and compare the pooled workers to the queen's genotype (note the
use of jitter for queen's genotype on the x-axis so we can spread out the dots
in the plot!).

```{r plot genoQ with poolW}
plot(y = poolW, x = jitter(genoQ), ylim = c(0, 2), xlim = c(0, 2),
     ylab = "Average allele dosage in workers",
     xlab = "Allele dosage in the queen" )
```

# Computing Genomic Relationship Matrices

This section introduces the calculations of IBD and IBS genomic relationship
matrices, so let's have a quick reminder of what these mean. Identity-by-state
(IBS) is a term used when two alleles, two segments or sequences of the genome
are identical. Identity-by-descent (IBD) is when a segment of matching (IBS) DNA
shared by two or more individuals has been inherited from a common ancestor.

Using IBD and IBS can allow a user to look into the relationships based on the
genomic data. We'll demonstrate this by calculating some Genomic Relationship
Matrices (GRM) using SIMplyBee's `calcBeeGRMIbs()` and `calcBeeGRMIbd()`.

Let's look at the `calcBeeGRMIbs()` first. This function returns a Genomic
Relatedness Matrix (GRM) for honeybees from IBS genomic data (bi-allelic SNP
represented as allele dosages) following the method for the sex X chromosome
(Druet and Legarra, 2020).

To see this, let's obtain the genotypes and sex information of all individuals
in the colony.

```{r genotypes}
geno <- getSegSiteGeno(colony, collapse = TRUE)
sex <- getCasteSex(x = colony, collapse = TRUE)
```

Now let's calculate the IBS GRM, we will use the genotypes to calculate this:

```{r calcBeeGRMIbs()}
GRM <- calcBeeGRMIbs(x = geno, sex = sex)
```

This produces a matrix that we can plot and summarise - its useful to summarise
diagonal and off-diagonal values separately.

```{r view diagonal}
library("Matrix")
image(as(GRM, "Matrix"))

x <- diag(GRM)
hist(x)
summary(x)
```

```{r view non-diagonal}
x <- GRM[lower.tri(x = GRM, diag = FALSE)]
hist(x)
summary(x)
```

We can also inspect GRM elements between specific caste members:

```{r compare caste memebers}
ids <- getCasteId(colony) 
idQueen <- ids$queen
idFathers <- ids$fathers
idWorkers <- ids$workers
idDrones <- ids$drones
idVirginQueens <- ids$virginQueens
mw <- "mw"
md <- "md"
```

```{r Queen vs fathers 1}
r <- range(GRM)
hist(GRM[idQueen, idFathers], xlim = r)
```

```{r Queen vs workers 1}
hist(GRM[idQueen, idWorkers], xlim = r)
```

```{r Queen vs drones 1}
hist(GRM[idQueen, idDrones], xlim = r)
```

`calcBeeGRMIbs()` uses the `calcBeeAlleleFreq()` function to calculate allele
frequencies for centering the honeybee genotypes. You can also use it in some
other cases:

```{r  alleleFreq}
hist(alleleFreq <- calcBeeAlleleFreq(x = geno, sex = sex))
```

Now lets look at `calcBeeGRMIbd()`. This function creates Genomic Relatedness
Matrix (GRM) for honeybees based on Identical-By-Descent (IBD) information. It
returns a list with a matrix of gametic relatedness coefficients (between
genomes) and a matrix of individual relatedness coefficients (between
individuals). Please refer to Grossman and Eisen (1989), Fernando and Grossman
(1989), Fernando and Grossman (1990), Van Arendonk, Tier, and Kinghorn (1994),
and Hill and Weir (2011) for the background on this function.

Now obtain the IBD haplotypes and compute IBD GRM.

```{r Set up haplotypes }
haploQ <- getQueenIbdHaplo(colony)
haploF <- getFathersIbdHaplo(colony)
haploW <- getWorkersIbdHaplo(colony)
haploD <- getDronesIbdHaplo(colony)
haploV <- getVirginQueensIbdHaplo(colony)
 
haplo <- rbind(haploQ, haploF, haploW, haploD, haploV)
```

```{r calcGRMIbd}
GRMs <- calcBeeGRMIbd(x = haplo)
```

Let's view this matrix:

```{r  view calcGRMIbd}
image(as(GRMs$genome, "Matrix"))
image(as(GRMs$indiv, "Matrix"))
```

Now we can look at the diagonal of the obtained matrices that represent 1 for a
genomes and 1 + inbreeding coefficient individuals.

```{r  view diagonal1}
i <- diag(GRMs$genome)
summary(x)
 
i <- diag(GRMs$indiv)
summary(i)
```

And now the non-diagonals that represent the coefficients of relationship
between genomes or between individuals.

```{r  view non-diagonal1}
x <- GRMs$genome[lower.tri(x = GRMs$genome, diag = FALSE)]
hist(x)
summary(x)
  
i <- GRMs$indiv[lower.tri(x = GRMs$indiv, diag = FALSE)]
hist(i)
summary(i)
```

Let's now compare compare relationships between caste members within a colony.

```{r}
# Obtains caste member IDs
qI <- getQueen(colony)@id
fI <- sort(getFathers(colony)@id)
wI <- sort(getWorkers(colony)@id)
dI <- sort(getDrones(colony)@id)
r <- range(GRMs$indiv)
```

Compare queen and fathers:

```{r  Queen vs fathers}
hist(GRMs$indiv[fI, qI], xlim = r)
```

Queen and workers:

```{r  Queen vs workers}
hist(GRMs$indiv[wI, qI], xlim = r)
```

Queen and drones:

```{r  Queen vs drones}
hist(GRMs$indiv[dI, qI], xlim = r)
```

# References

Druet and Legarra (2020) Theoretical and empirical comparisons of expected and
realized relationships for the X-chromosome. Genetics Selection Evolution,
52:50. <https://doi.org/10.1186/s12711-020-00570-6>

Grossman and Eisen (1989) Inbreeding, coancestry, and covariance between
relatives for X-chromosomal loci. The Journal of Heredity, 80(2):137--142.
<https://doi.org/10.1093/oxfordjournals.jhered.a110812>

Fernando and Grossman (1989) Covariance between relatives for X-chromosomal loci
in a population in disequilibrium. Theoretical and Applied Genetics,
77:311--319. <https://doi.org/10.1007/bf00305821>

Fernando and Grossman (1990) Genetic evaluation with autosomal and X-chromosomal
inheritance. Theoretical and Applied Genetics, 80:75--80.
<https://doi.org/10.1007/bf00224018>

Van Arendonk, Tier, and Kinghorn (1994) Use of multiple genetic markers in
prediction of breeding values. Genetics, 137(1):319--329.
<https://doi.org/10.1093/genetics/137.1.319>

Hill and Weir (2011) Variation in actual relationship as a consequence of
Mendelian sampling and linkage. Genetics Research, 93(1):47--64.
<https://doi.org/10.1017/s0016672310000480>
