---
title: "Ensembl BioMart Examples"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Ensembl BioMart Examples}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
options(width = 750)
knitr::opts_chunk$set(
  comment = "#>",
  error = FALSE,
  tidy = FALSE)
```

## Use Case #1: Functional Annotation of Genes Sharing a Common Evolutionary History

Evolutionary Transcriptomics aims to predict stages or periods of evolutionary conservation in
biological processes on the transcriptome level. However, finding genes sharing a [common evolutionary history](https://drostlab.github.io/myTAI/articles/Enrichment.html) could reveal how the the biological process might have evolved in the first place.

In this `Use Case` we will combine functional and biological annotation obtained with `biomartr` with enriched genes obtained with [PlotEnrichment()](https://drostlab.github.io/myTAI/articles/Enrichment.html). 

### Step 1

For the following example we will use the dataset an enrichment analyses found in [PlotEnrichment()](https://drostlab.github.io/myTAI/articles/Enrichment.html).

Install and load the [myTAI](https://github.com/drostlab/myTAI) package:

```{r, eval=FALSE}
# install myTAI
install.packages("myTAI")

# load myTAI
library(myTAI)
```

Download the `Phylostratigraphic Map` of _D. rerio_:

```r
# download the Phylostratigraphic Map of Danio rerio
# from Sestak and Domazet-Loso, 2015
```

The dataset comes from `Supplementary file 3` of this publication: https://academic.oup.com/mbe/article/32/2/299/1058654#77837069

After downloading `Supplementary file 3`, you will find the file `TableS3-2.xlsx` which can be used in the following `biomartr` functions.



Read the `*.xlsx` file storing the `Phylostratigraphic Map` of _D. rerio_ and format it for the use with `myTAI`:

```r
# install the readxl package
install.packages("readxl")

# load package readxl
library(readxl)

# read the excel file
DrerioPhyloMap.MBEa <- read_excel("TableS3-2.xlsx", sheet = 1, skip = 4)

# format Phylostratigraphic Map for use with myTAI
Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2]

# have a look at the final format
head(Drerio.PhyloMap)
```

```
  Phylostrata            ZFIN_ID
1           1 ZDB-GENE-000208-13
2           1 ZDB-GENE-000208-17
3           1 ZDB-GENE-000208-18
4           1 ZDB-GENE-000208-23
5           1  ZDB-GENE-000209-3
6           1  ZDB-GENE-000209-4
```

Now, `Drerio.PhyloMap` stores the `Phylostratigraphic Map` of _D. rerio_ which is used
as background set to perform enrichment analyses with `PlotEnrichment()` from `myTAI`.

### Enrichment Analyses

Now, the `PlotEnrichment()` function visualizes the over- and underrepresented `Phylostrata` of brain specific genes when compared with the total number of genes stored in the `Phylostratigraphic Map` of _D. rerio_.


```{r,eval=FALSE}
library(readxl)

# read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015
Drerio.OrganSpecificExpression <- read_excel("TableS3-2.xlsx", sheet = 2, skip = 3)

# select only brain specific genes
Drerio.Brain.Genes <- unlist(unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"])))

# visualize enriched Phylostrata of genes annotated as brain specific
PlotEnrichment(Drerio.PhyloMap,
               test.set     = Drerio.Brain.Genes,
               measure      = "foldchange",
               use.only.map = TRUE,
               legendName   = "PS")
```

Users will observe that for example brain genes deriving from PS5 are significantly enriched. 


Now we can select all brain genes originating in PS5 using the `SelectGeneSet()` function from `myTAI`. Please notice that `SelectGeneSet()` can only be used with phylostratigraphic maps only (`use.map.only = TRUE` argument) since myTAI version > 0.3.0.

```{r,eval=FALSE}
BrainGenes <- SelectGeneSet(ExpressionSet = Drerio.PhyloMap,
                            gene.set      = Drerio.Brain.Genes,
                            use.only.map  = TRUE)

# select only brain genes originating in PS5
BrainGenes.PS5 <- BrainGenes[which(BrainGenes[ , "Phylostrata"] == 5), ]

# look at the results
head(BrainGenes.PS5)
```

```
      Phylostrata           ZFIN_ID
14851           5 ZDB-GENE-000210-6
14852           5 ZDB-GENE-000210-7
14853           5 ZDB-GENE-000328-4
14856           5 ZDB-GENE-000411-1
14857           5 ZDB-GENE-000427-4
14860           5 ZDB-GENE-000526-1
```

Now users can perform the `biomart()` function to obtain the functional annotation of brain genes originating in PS5.

For this purpose, first we need to find the filter name of the corresponding gene ids such as `ZDB-GENE-000210-6`.

```{r, eval=FALSE}
# find filter for zfin.org ids
organismFilters("Danio rerio", topic = "zfin_id")
```

```
                            name                           description             dataset
52                  with_zfin_id                       with ZFIN ID(s) drerio_gene_ensembl
53  with_zfin_id_transcript_name          with ZFIN transcript name(s) drerio_gene_ensembl
103                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-060825-136] drerio_gene_ensembl
274                 with_zfin_id                       with ZFIN ID(s)    drerio_gene_vega
286                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-121214-212]    drerio_gene_vega
366                 with_zfin_id                       with ZFIN ID(s) drerio_gene_ensembl
367 with_zfin_id_transcript_name          with ZFIN transcript name(s) drerio_gene_ensembl
417                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-060825-136] drerio_gene_ensembl
588                 with_zfin_id                       with ZFIN ID(s)    drerio_gene_vega
600                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-121214-212]    drerio_gene_vega
680                 with_zfin_id                       with ZFIN ID(s) drerio_gene_ensembl
681 with_zfin_id_transcript_name          with ZFIN transcript name(s) drerio_gene_ensembl
731                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-060825-136] drerio_gene_ensembl
902                 with_zfin_id                       with ZFIN ID(s)    drerio_gene_vega
914                      zfin_id ZFIN ID(s) [e.g. ZDB-GENE-121214-212]    drerio_gene_vega
                    mart
52  ENSEMBL_MART_ENSEMBL
53  ENSEMBL_MART_ENSEMBL
103 ENSEMBL_MART_ENSEMBL
274 ENSEMBL_MART_ENSEMBL
286 ENSEMBL_MART_ENSEMBL
366 ENSEMBL_MART_ENSEMBL
367 ENSEMBL_MART_ENSEMBL
417 ENSEMBL_MART_ENSEMBL
588 ENSEMBL_MART_ENSEMBL
600 ENSEMBL_MART_ENSEMBL
680 ENSEMBL_MART_ENSEMBL
681 ENSEMBL_MART_ENSEMBL
731 ENSEMBL_MART_ENSEMBL
902 ENSEMBL_MART_ENSEMBL
914 ENSEMBL_MART_ENSEMBL
```

Now users can retrieve the corresponding GO attribute of _D. rerio_ with `organismAttributes`.

```{r,eval=FALSE}
# find go attribute term for D. rerio
organismAttributes("Danio rerio", topic = "go")
```

```
                                              name                             description
33                                           go_id                       GO Term Accession
36                                 go_linkage_type                   GO Term Evidence Code
38                            goslim_goa_accession                 GOSlim GOA Accession(s)
39                          goslim_goa_description                  GOSlim GOA Description
516                  ggorilla_homolog_ensembl_gene                 Gorilla Ensembl Gene ID
517  ggorilla_homolog_canomical_transcript_protein      Canonical Protein or Transcript ID
518               ggorilla_homolog_ensembl_peptide              Gorilla Ensembl Protein ID
519                    ggorilla_homolog_chromosome                 Gorilla Chromosome Name
520                   ggorilla_homolog_chrom_start           Gorilla Chromosome Start (bp)
521                     ggorilla_homolog_chrom_end             Gorilla Chromosome End (bp)
522                ggorilla_homolog_orthology_type                           Homology Type
523                       ggorilla_homolog_subtype                                Ancestor
524          ggorilla_homolog_orthology_confidence    Orthology confidence [0 low, 1 high]
525                       ggorilla_homolog_perc_id   % Identity with respect to query gene
526                    ggorilla_homolog_perc_id_r1 % Identity with respect to Gorilla gene
527                            ggorilla_homolog_dn                                      dN
528                            ggorilla_homolog_ds                                      dS
1240                                         go_id                                   GO ID
1241                                      quick_go                             Quick GO ID
1370                                         go_id                       GO Term Accession
1373                               go_linkage_type                   GO Term Evidence Code
1375                          goslim_goa_accession                 GOSlim GOA Accession(s)
1376                        goslim_goa_description                  GOSlim GOA Description
1853                 ggorilla_homolog_ensembl_gene                 Gorilla Ensembl Gene ID
1854 ggorilla_homolog_canomical_transcript_protein      Canonical Protein or Transcript ID
1855              ggorilla_homolog_ensembl_peptide              Gorilla Ensembl Protein ID
1856                   ggorilla_homolog_chromosome                 Gorilla Chromosome Name
1857                  ggorilla_homolog_chrom_start           Gorilla Chromosome Start (bp)
1858                    ggorilla_homolog_chrom_end             Gorilla Chromosome End (bp)
1859               ggorilla_homolog_orthology_type                           Homology Type
1860                      ggorilla_homolog_subtype                                Ancestor
1861         ggorilla_homolog_orthology_confidence    Orthology confidence [0 low, 1 high]
1862                      ggorilla_homolog_perc_id   % Identity with respect to query gene
1863                   ggorilla_homolog_perc_id_r1 % Identity with respect to Gorilla gene
1864                           ggorilla_homolog_dn                                      dN
1865                           ggorilla_homolog_ds                                      dS
2577                                         go_id                                   GO ID
2578                                      quick_go                             Quick GO ID
2707                                         go_id                       GO Term Accession
2710                               go_linkage_type                   GO Term Evidence Code
2712                          goslim_goa_accession                 GOSlim GOA Accession(s)
2713                        goslim_goa_description                  GOSlim GOA Description
3190                 ggorilla_homolog_ensembl_gene                 Gorilla Ensembl Gene ID
3191 ggorilla_homolog_canomical_transcript_protein      Canonical Protein or Transcript ID
3192              ggorilla_homolog_ensembl_peptide              Gorilla Ensembl Protein ID
3193                   ggorilla_homolog_chromosome                 Gorilla Chromosome Name
3194                  ggorilla_homolog_chrom_start           Gorilla Chromosome Start (bp)
3195                    ggorilla_homolog_chrom_end             Gorilla Chromosome End (bp)
3196               ggorilla_homolog_orthology_type                           Homology Type
3197                      ggorilla_homolog_subtype                                Ancestor
3198         ggorilla_homolog_orthology_confidence    Orthology confidence [0 low, 1 high]
3199                      ggorilla_homolog_perc_id   % Identity with respect to query gene
3200                   ggorilla_homolog_perc_id_r1 % Identity with respect to Gorilla gene
3201                           ggorilla_homolog_dn                                      dN
3202                           ggorilla_homolog_ds                                      dS
3914                                         go_id                                   GO ID
3915                                      quick_go                             Quick GO ID
                 dataset                 mart
33   drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
36   drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
38   drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
39   drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
516  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
517  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
518  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
519  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
520  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
521  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
522  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
523  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
524  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
525  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
526  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
527  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
528  drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1240    drerio_gene_vega ENSEMBL_MART_ENSEMBL
1241    drerio_gene_vega ENSEMBL_MART_ENSEMBL
1370 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1373 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1375 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1376 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1853 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1854 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1855 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1856 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1857 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1858 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1859 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1860 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1861 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1862 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1863 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1864 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
1865 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
2577    drerio_gene_vega ENSEMBL_MART_ENSEMBL
2578    drerio_gene_vega ENSEMBL_MART_ENSEMBL
2707 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
2710 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
2712 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
2713 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3190 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3191 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3192 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3193 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3194 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3195 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3196 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3197 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3198 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3199 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3200 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3201 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3202 drerio_gene_ensembl ENSEMBL_MART_ENSEMBL
3914    drerio_gene_vega ENSEMBL_MART_ENSEMBL
3915    drerio_gene_vega ENSEMBL_MART_ENSEMBL
```

Now users can specify the filter `zfin_id` and attribute `go_id` to retrieve the GO terms of corresponding gene ids (__Please note that this will take some time__).

```{r, eval=FALSE}
# retrieve GO terms of D. rerio brain genes originating in PS5
GO_tbl.BrainGenes <- biomart(genes      = unlist(BrainGenes.PS5[ , "ZFIN_ID"]),
                             mart       = "ENSEMBL_MART_ENSEMBL",
                             dataset    = "drerio_gene_ensembl",
                             attributes = "go_id",
                             filters    = "zfin_id")

head(GO_tbl.BrainGenes)
```

```
            zfin_id      go_id
1 ZDB-GENE-000210-6 GO:0060037
2 ZDB-GENE-000210-6 GO:0046983
3 ZDB-GENE-000210-7 GO:0046983
4 ZDB-GENE-000328-4 GO:0007275
5 ZDB-GENE-000328-4 GO:0007166
6 ZDB-GENE-000328-4 GO:0035567
```





