---
title: "bumblebee: Estimate transmission flows within and between population groups accounting for sampling heterogeneity."
output: rmarkdown::html_vignette
author: "Lerato E. Magosi"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{bumblebee: Estimate transmission flows within and between population groups accounting for sampling heterogeneity.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE}
# Global options
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path="fig/"
)
options(tibble.print_min = 4L, tibble.print_max = 4L)
```

![](fig/bumblebee_transmission_flows_with_title_and_labels_img.png){width=95%}

## Background

To control the spread of infectious disease it is important to quantify the impact of
interventions and factors such as: age, sex, socio-economic status and
geographical location in shaping patterns of transmission.

The **_Bumblebee_** package uses counts of directed transmission pairs identified 
between samples from population groups of interest to estimate the flow of transmissions 
within and between those population groups accounting for sampling heterogeneity.

Counts of observed directed transmission pairs can be obtained from deep-sequence 
phylogenetic data (via [phyloscanner](https://github.com/BDI-pathogens/phyloscanner))
or known epidemiological contacts. 

**Note**: Deep-sequence data is also commonly referred to as high-throughput or 
next-generation sequence data.


### Example application areas include: 

1. Quantifying transmission patterns of HIV, the virus that causes AIDS, in the 
   context of HIV prevention initiatives such as universal test-and-treat. 
   
   **To learn more see:** Magosi LE, et al., Deep-sequence phylogenetics to 
   quantify patterns of HIV transmission in the context of a universal testing 
   and treatment trial – BCPP/ Ya Tsie trial. To submit for publication, 2021.
 
2. Quantifying transmission patterns of SARS-COV-2, the virus that causes COVID-19, 
   in the presence of heterogeneous vaccine uptake.


#### This vignette walks through the steps to estimate transmission flows and confidence intervals.
 
---

## Data:

```

We shall use the data of HIV transmissions within and between intervention and control
communities in the BCPP/Ya Tsie HIV prevention trial. 

The BCPP / Ya Tsie study was a pair-matched community-randomized trial involving
30 communities in Botswana to test the effect of a universal HIV test-and-treat 
intervention in efficiently reducing the occurrence of new HIV infections at the 
population level.

To learn more about the data: 

# Counts of directed HIV transmission pairs identified between samples from
# intervention and control communities.
?counts_hiv_transmission_pairs, 

# Estimated number of individuals with HIV in intervention and control
# communities and the number of individuals sampled from each.
?sampling_frequency  

# Estimated transmission flows or relative probability of transmission
# within and between population groups adjusted for variable sampling
# among the population groups. 
# Note: The `theta_hat` variable denotes estimated transmission flows.
?estimated_hiv_transmission_flows
 
```

#### The data was sourced from:

Magosi LE, et al., Deep-sequence phylogenetics to quantify patterns of 
HIV transmission in the context of a universal testing and treatment
trial – BCPP/ Ya Tsie trial. To submit for publication, 2021.


## A basic analysis: Estimating transmission flows within and between population groups

We shall use the `estimate_transmission_flows_and_ci()` function to estimate transmission 
flows and corresponding confidence intervals within and between intervention and control 
communities of the BCPP / Ya Tsie trial.  See `?estimate_transmission_flows_and_ci()` to
learn more about the function.

The `estimate_transmission_flows_and_ci()` function 
requires the following inputs for analysis:

* A character vector of population groups/strata (e.g. communities, age-groups, genders or trial arms) 
  between which to estimate transmission flows.
 
* A numeric vector indicating the number of individuals sampled per population group 

* A numeric vector of the estimated number of individuals per population group 

* A data.frame of counts of directed transmission pairs identified between samples 
  from population groups of interest.


```

# Load libraries  ------------------------------------------------

library(bumblebee) # for estimating transmission flows
library(dplyr)     # for manipulating data.frames


# Estimate transmission flows and confidence intervals  --------------------------

# We shall use the data of HIV transmissions within and between intervention and control
# communities in the BCPP/Ya Tsie HIV prevention trial. To learn more about the data 
# ?counts_hiv_transmission_pairs and ?sampling_frequency 

# View counts of observed directed HIV transmissions within and between 
# intervention and control communities (n = 82)
counts_hiv_transmission_pairs

# View the estimated number of individuals with HIV in intervention and control 
# communities and the number of individuals sampled from each
sampling_frequency

# Estimate transmission flows within and between intervention and control communities
# accounting for variable sampling among population groups. 

# Basic output

results_estimate_transmission_flows_and_ci <- estimate_transmission_flows_and_ci(
    group_in = sampling_frequency$population_group, 
	individuals_sampled_in = sampling_frequency$number_sampled, 
	individuals_population_in = sampling_frequency$number_population, 
	linkage_counts_in = counts_hiv_transmission_pairs)
 
# View results
results_estimate_transmission_flows_and_ci

# Retrieve dataset of estimated transmission flows 
dframe <- results_estimate_transmission_flows_and_ci$flows_dataset

```

## Interpretation of results:

The `theta_hat` variable denotes estimated proportions of HIV transmissions in 
the trial population within and between intervention and control communities.
There was substantial sexual mixing between intervention and control communities.
Transmissions into intervention communities from control communities were three
times more common than the reverse, compatible with a benefit from the universal 
HIV test-and-treat intervention.

#### See `?estimate_transmission_flows_and_ci()` for a description of all the output variables



## A step further: Exploring available options


Further to estimating transmission flows, the bumblebee package provides estimates for:


* p_hat, the probability of linkage between pathogen sequences from two individuals randomly 
  sampled from their respective population groups

* p_group_pairing_linked, the joint probability that a pair of pathogen sequences is 
  from a specific population group pairing and linked

* c_hat, the probability of clustering, more precisely, the probability that a pathogen 
  sequence from one population group links with at least one pathogen sequence from 
  another population group


and confidence intervals for the following methods: 

* Goodman with a continuity correction (useful for small samples) 

* Sison-Glaz

* Queensbury-Hurst


```

# Estimate transmission flows and confidence intervals: Detailed output  -----------------


# Detailed output

results_estimate_transmission_flows_and_ci_detailed <- estimate_transmission_flows_and_ci(
    group_in = sampling_frequency$population_group, 
    individuals_sampled_in = sampling_frequency$number_sampled, 
    individuals_population_in = sampling_frequency$number_population, 
    linkage_counts_in = counts_hiv_transmission_pairs,
    detailed_report = TRUE,
    verbose_output = TRUE)
 
# View results
results_estimate_transmission_flows_and_ci_detailed

# Retrieve dataset of estimated transmission flows 
dframe <- results_estimate_transmission_flows_and_ci_detailed$flows_dataset

```

