--- title: "EZbakR for people in a hurry" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quickstart} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} abstract: > EZbakR is a one-stop-shop for analyses of nucleotide recoding RNA-seq datasets (NR-seq). NR-seq refers to a class of methods (e.g., TimeLapse-seq, SLAM-seq, TUC-seq, etc.) that combine RNA-seq, metabolic labeling, and unique metabolic label recoding chemistries. These methods were originally developed to dissect the kinetics of RNA synthesis and degradation. Excitingly though, a treasure trove of extensions of the original methods have been created over the years. To-date, nucleotide recoding has been combined with TT-seq, Start-seq, Ribo-seq, scRNA-seq, Perturb-seq, and subcellular fractionation, just to name a few such extensions. In addition, while the original methods used 4-thiouridine (s^4^U), the same chemistry has been found to work with 6-thioguanosine (s^6^G), opening the door to dual-labeling experimental designs (e.g., TILAC). EZbakR was designed to facilitate anlayses of the full gamut of NR-seq methods. While not all analyses implemented in EZbakR can be applied to all variants of NR-seq, EZbakR's modular design means that it has something for everyone. In this vignette, we will quickly highlight the main modules of EZbakR. For simplicity, we will be focusing on how to perform a standard, single-label NR-seq analysis. Other vignettes go into greater detail about each module and the broad array of analyses that can be performed by each. --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(6, 4) ) ``` ## Necessary Setup You will need to install and load EZbakR. Instructions to do so can be found at this [link](https://github.com/isaacvock/EZbakR). Setting the seed will also ensure exact reproducibility of analyses of simulated data shown in this vignette. Finally, I will use a little bit of dplyr to showcase some key results: ```{r setup} library(EZbakR) library(dplyr) set.seed(42) ``` ## Step 0: Create an EZbakRData Object To use EZbakR, you will need to create an `EZbakRData` object. An `EZbakRData` object consists of two components: a cB data frame and a metadf data frame. cB stands for "counts binomial" and contains all of the information about mutations seen in sequencing reads in each sample sequenced. metadf stands for "metadata data frame" and contains information about the experimental details of each sample (i.e., how long the metabolic label feed was). Examples of what these data structures look like are lazily loaded when you load `EZbakR`. This means that you can call them as if you created them yourselves, with `example_cB` and `example_metadf`: ```{r, eval = FALSE} # Check out examples: example_cB example_metadf ``` A cB data frame consists of rows corresponding to groups of reads with identical data, where data corresponds to the: 1. Sample of origin. This should be noted in the **sample** column of the cB. 2. Mutational content. This should be noted by one or more columns of the form (e.g., **TC** in `example_cB`). The value in these columns should correspond to the observed number of mutations. 3. Mutable nucleotide content. This should be noted by one or more columns of the form n (e.g., **nT** in `example_cB`). The value in these columns should correspond to the number of mutable bases overlapped by the read. 4. Genomic features the read was assigned to. In EZbakR, these can be named anything. The fastq2EZbakR pipeline most commonly paired with EZbakR can include feature columns such as those which show up in `example_cB`: * **GF**: Stands for "gene feature" and represents the gene (exonic or intronic regions) the read overlapped * **XF**: Stands for "exonic feature" and represents the gene (exonic regions only) the read overlapped * **rname**: Chromosome the read came from. The final column of the cB data frame is **n**, the number of reads with identical data for the other columns. cB data frames are most easily obtained from the Snakemake pipeline fastq2EZbakR, available [here](https://github.com/isaacvock/fastq2EZbakR). A metadf table has the following required columns: 1. **sample**: Sample name, as it shows up in the cB table. 2. Metabolic label time information. If you have a single label, this should be documented in a column called **tl**. If you used multiple labels, then the different label times should be noted in a column called **tl_**, where denotes the type of mutation that conversion of the label yields. For example, if you are using s^4^U (conversion yields T-to-C mutations) and s^6^G (conversion yields G-to-A mutations), then your metadf should include columns **tl_TC** and **tl_GA**. The remaining columns can be named anything you want and should describe aspects of each experiment. NOTE: these are currently exclusively interpreted as discrete factors. Things like time-series modeling is not currently supported, but may be in future releases. These factors will be used when deciding how to group samples when averaging replicate data and performing comparative analyses. More on this later! Once you have a cB and metadf table, you can create an EZbakRData object: ```{r} ezbdo <- EZbakRData(example_cB, example_metadf) ``` Now, let's generate some simulated data to be used in the remaining steps: ```{r} # Simulate 500 "genes" worth of data # Simulates 2 replicates of 2 different experimental conditions simdata <- EZSimulate(nfeatures = 500, nreps = 2) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ``` ## Step 1: EstimateFractions() The first step of any NR-seq analysis is to figure out what fraction of your reads come from each expected mutational population. For example, in a standard s^4^U labeling NR-seq experiment, you will have reads from a high T-to-C mutation population and reads from a low T-to-C mutation population. Technically, this involves two steps: 1) estimating mutation rates of the two populations in each labeled sample and 2) estimating fractions. Both of these steps are performed with a single function in EZbakR: `EZbakRFractions()`: ```{r} ezbdo <- EstimateFractions(ezbdo, features = "feature") ### NOTE: ### For real data from fastq2EZbakR, you will likely ### want to set `features = "XF"`. See EstimateFractions() ### vignette for details ``` See `?EstimateFractions()` as well as the [associated vignette](https://isaacvock.github.io/EZbakR/articles/EstimateFractions.html) for details and descriptions of optional parameters. We can use the simulated ground truth to assess estimate accuracy and confirm that the method is working: ```{r, fig.align='center'} # Simulated ground truth truth <- simdata$PerRepTruth # Get fraction estimates. # Can also access with ezbdo$fractions[[1]] or ezbdo$fractions$feature est <- EZget(ezbdo, type = 'fractions') # Combine compare <- dplyr::inner_join(est, truth, by = c('sample', 'feature')) # Assess accuracy plot(compare$true_fraction_highTC, compare$fraction_highTC) abline(0,1) ``` Looks good! ## Step 2: EstimateKinetics() The next step of a standard NR-seq analysis is to use the estimates of mutational population proportions (and possibly normalized read counts) to estimate kinetic parameters of interest (e.g., degradation and synthesis rate constants). The simplest and quickest way to do this is with `EstimateKinetics()`: ```{r} ezbdo <- EstimateKinetics(ezbdo) ``` By default, this uses a standard, steady-state, single rate constant model of RNA metabolism to infer a degradation rate constant (`kdeg`) and synthesis rate constant (`ksyn`) for each gene. Other analysis strategies implemented in `EstimateKinetics()` include those more appropriate for non-steady-state systems or short label times. The `strategy` argument in `EstimateKinetics()` sets which analysis strategy is used. See `?EstimateKinetics` for details. Again, let's compare to ground_truth: ```{r, fig.align='center'} # Simulated ground truth truth <- simdata$PerRepTruth # Get fraction estimates. # Can also access with ezbdo$fractions[[1]] or ezbdo$fractions$feature est <- EZget(ezbdo, type = 'kinetics') # Combine compare <- dplyr::inner_join(est, truth, by = c('sample', 'feature')) # Assess accuracy plot(log(compare$true_kdeg), compare$log_kdeg) abline(0,1) ``` Great! ## Step 3: AverageAndRegularize() Typically, you will have multiple replicates of each experimental condition. In that case, the next step is to average the data across these replicates and estimate overall uncertainties. This can then be passed along to downstream comparative analyses to see which features have differences in their kinetic parameters between different experimental conditions. This averaging process is done with `AverageAndRegularize()`: ```{r} ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_kdeg") ``` See `?AverageAndRegularize()` for details. One cool thing about `AverageAndRegularize()` is that it implements a heteroskedastic linear model and can thus be passed formula objects to detail how you would like to stratify samples, and whether or not there are any batches you would like to regress out of the main parameter effects. See the relevant vignette for more details. Finally, setting `parameter = "log_kdeg"` was not technically necessary as it is the default, but it showcases that you need to specify which kinetic parameter you want to average across replicates. If you run this again with a different parameter, a new table will be added to the `ezbdo$averages` list: ```{r} ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn") ``` Let's make sure the estimates are still accurate: ```{r, fig.align='center'} truth <- simdata$AvgTruth est <- EZget(ezbdo, type = 'averages', parameter = 'log_kdeg') compare <- dplyr::inner_join(truth, est, by = 'feature') plot(compare$true_logkdeg_treatmenttreatment1, compare$mean_treatmenttreatment1) abline(0,1) ``` All good! ## Step 4: CompareParameters() Once you have replicate averaged parameter estimates, you can compare them and assess the "statistical significance" of any observed differences. This is done with `CompareParameters()`: ```{r} ezbdo <- CompareParameters(ezbdo, condition = 'treatment', reference = 'treatment1', experimental = 'treatment2', parameter = "log_kdeg") ``` In addition to providing the `EZbakRData` object output by `AverageAndRegularize()`, you also need to specify three other arguments: 1. `condition`: the metadf column (or function of columns) that replicates were stratified by. The simulated metadf has a single experimental detail column called `treatment`, so this is the condition here. 2. `reference`: what value of `condition` do you want to consider the "reference" level. 3. `experimental`: what value of `condition` do you want to consider the "experimental" level. Once again, `parameter`'s default value is "log_kdeg", but is specified here to be explicit.`CompareParameters()` will calculate the difference in average parameter as: `experimental` - `reference`. EZbakR provides some convenient visualization functions to inspect the output of `CompareParameters()`. Volcano plot: ```{r, fig.align='center'} EZVolcanoPlot(ezbdo, parameter = "log_kdeg", condition = 'treatment', reference = 'treatment1', experimental = 'treatment2') ``` MA plot: ```{r, fig.align='center'} EZMAPlot(ezbdo, parameter = "log_kdeg", condition = 'treatment', reference = 'treatment1', experimental = 'treatment2') ```