---
title: "The Beta-Binomial Test for Count Data"
author: "Thang V Pham"
date: "2021-01-24"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{countdata: The Beta-Binomial Test for Count Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = FALSE,
  comment = "#>"
)
```

A number of technology platforms in proteomics and genomics produce count data for quantitative analysis. In proteomics, the number of MS/MS events observed for a protein in a mass spectrometry experiment has been shown to correlate strongly with the protein's abundance. In genomics, next-generation sequencing technologies use read counts as a measure of the abundance of the target transcripts. The R package _countdata_ contains functions for statistical significance analysis of count data for both paired and unpaired designs.

## Quick start

The user needs to install the package once, most likely by entering `install.packages("countdata")` in the R console, and load the package before use.

```{r setup}
library(countdata)
```

The following are three complete examples, from reading input data in a tab-deliminated text format to writing the result to a local text file `output.txt`. The output column `pval` contains the p-values of the test and the column `pval.BH` contains the p-values adjusted for multiple testing using the Benjamini-Hochberg method (Benjamini \& Hochberg, 1995).

**Two-group beta-binomial test** (Pham et al., Bioinformatics 2010)
```{r}
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

head(d)
#>    a1  a2  a3  b1  b2  b3  c1  c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367  81 118

# compare the first 3 samples against the next three samples
out <- countdata::bb.test(d[, 1:6], 
                          colSums(d[, 1:6]), 
                          c(rep("a", 3), rep("b", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 2, no. of samples = 6...
#> 5%
#> 11%
#> 18%
#> 23%
#> 30%
#> 35%
#> 41%
#> 49%
#> 55%
#> 64%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.

d.norm <-  countdata::normalize(d[, 1:8])

write.table(cbind(d, d.norm, 
                  fc = countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6]),
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")
```

**Three-group beta-binomial test** (Pham et al., Bioinformatics 2010)
```{r}
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

head(d)
#>    a1  a2  a3  b1  b2  b3  c1  c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367  81 118

# compare the first 3 samples, the next three samples, and the last two samples.
out <- countdata::bb.test(d[, 1:8],
                          colSums(d[, 1:8]),
                          c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 10%
#> 16%
#> 23%
#> 33%
#> 40%
#> 45%
#> 51%
#> 58%
#> 63%
#> 70%
#> 76%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.

d.norm <-  countdata::normalize(d[, 1:8])

write.table(cbind(d, d.norm, 
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")
```

**Two-group paired beta-binomial test** (Pham \& Jimenez, Bioinformatics 2012)
```{r}
d <- read.delim("https://tvpham.github.io/data/example-paired.txt")

head(d)
#>   pre.1 pre.2 pre.3 post.1 post.2 post.3
#> 1   575   179   335    505    172    204
#> 2   294   245   256    396    390    265
#> 3   293   282   320    372    240    204
#> 4   303   282   250    307    243    227
#> 5   396   271   171    327    216    103
#> 6   238   261   271    245    234    215

out <- countdata::ibb.test(d[, 1:6], 
                           colSums(d[, 1:6]), 
                           c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 26%
#> 31%
#> 36%
#> 41%
#> 46%
#> 51%
#> 57%
#> 63%
#> 68%
#> 73%
#> 78%
#> 83%
#> 88%
#> 94%
#> 99%
#> Done.

d.norm <-  countdata::normalize(d[, 1:6])

write.table(cbind(d, d.norm, 
                  fc = out$fc,
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")

```

## Unpaired test

The beta-binomial test (Pham et al., Bioinformatics 2010) is used for significance analysis of independent samples for two or more groups. Suppose that a vector `x` contains the count numbers and a vector `tx` the corresponding normalization sizes, for example the total spectral counts per sample in a proteomics experiment. In addition, the group information is specified in a vector `group`. We perform a beta-binomial test as follows.
```{r}
x <- c(1, 5, 1, 10, 9, 11, 2, 8)

tx <- c(19609, 19053, 19235, 19374, 18868, 19018, 18844, 19271)

group <- c(rep("cancer", 3), rep("normal", 5))

countdata::bb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of groups = 2, no. of samples = 8...
#> 100%
#> Done.
#> $p.value
#> [1] 0.01568598
```

The test can be applied to a data matrix row by row. The following example compares three groups, the first group consisting of columns 1 to 3 of a data matrix, the second columns 4 to 6, and the third columns 7 and 8.
```{r}
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

# compare 3 groups, using all available CPU cores
out <- countdata::bb.test(d[, 1:8], 
                          colSums(d[, 1:8]), 
                          c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 28%
#> 34%
#> 40%
#> 46%
#> 52%
#> 58%
#> 77%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.
```

The output `out$p.value` contains p-values for each row of `d`. For multiple testing correction, use the `p.adjust` function in R. For example, to apply the Benjamini & Hochberg method for p-value adjustment
```{r}
pval.BH <- p.adjust(out$p.value, method = "BH")
```

### Data normalization

The beta-binomial test takes **raw counts as input** and normalizes the values internally. To obtain normalized values, we multiply each sample by a scaling factor so that the total counts are equal across samples. This procedure is implemented in the `normalize` function.
```{r}
d.norm <-  countdata::normalize(d[, 1:8])

# check -- all values should be equal
colSums(d.norm)
#>    a1    a2    a3    b1    b2    b3    c1    c2 
#> 19159 19159 19159 19159 19159 19159 19159 19159
```

The matrix `d.norm` contains the normalized data. We can combine the normalized data and the result of the beta-binomial test by `cbind(d.norm, out$p.value)`, and subsequently save or view the resulting matrix
```{r}
head(cbind(d.norm, out$p.value))
#>         a1       a2       a3       b1       b2       b3        c1       c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209  82.35401 117.3142
#>    out$p.value
#> 1 0.0001164498
#> 2 0.0016150240
#> 3 0.4501729545
#> 4 0.0003105657
#> 5 0.2521494212
#> 6 0.0001057092
```

### Fold change calculation

The package provides a convenient function to calculate the ratio between the averages of two sample groups row by row. For example, to calculate the fold change between all group pairs
```{r}
fc12 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6])
fc13 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 7:8])
fc23 <- countdata::fold.change(d.norm[, 4:6], d.norm[, 7:8], BIG = 100)

head(cbind(d.norm, out$p.value, fc12, fc13, fc23))
#>         a1       a2       a3       b1       b2       b3        c1       c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209  82.35401 117.3142
#>    out$p.value      fc12      fc13      fc23
#> 1 0.0001164498 -1.360633 -1.746148 -1.283335
#> 2 0.0016150240 -1.938309 -2.298320 -1.185735
#> 3 0.4501729545 -1.029825 -1.248907 -1.212738
#> 4 0.0003105657 -1.342337 -1.808716 -1.347438
#> 5 0.2521494212 -1.245834 -1.252351 -1.005232
#> 6 0.0001057092 -1.244604 -4.071068 -3.270976
```

Note that a positive fold change value means up-regulation where the average of the second group is higher than that of the first group. Conversely, a negative value means down-regulation where the average of the first group is higher than that of the second. If one group contains all zeros, a positive or negative big value is returned (default BIG = 10000).

## Paired test

The inverted beta-binomial test (Pham \& Jimenez, Bioinformatics 2012) is used for paired sample testing, for example between pre-treatment and post-treatment data. The following is an example of a paired test.

```{r}
x <- c(33, 32, 86, 51, 52, 149)

tx <- c(7742608, 15581382, 20933491, 7126839, 13842297, 14760103)

group <- c(rep("cancer", 3), rep("normal", 3))

countdata::ibb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of pair(s) = 3...
#> 100%
#> Done.
#> $p.value
#> [1] 0.004103636
#> 
#> $fc
#> [1] 2.137632
```

Analogously to the unpaired situation, the test can be performed for a data matrix row by row. The following example compares two groups where columns 1 to 3 are respectively paired with columns 4 to 6.
```{r}
d <- read.delim("https://tvpham.github.io/data/example-paired.txt")

# perform a paired test for all rows
out <- countdata::ibb.test(d[, 1:6], 
                           colSums(d[, 1:6]), 
                           c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 10%
#> 16%
#> 21%
#> 26%
#> 31%
#> 37%
#> 42%
#> 47%
#> 52%
#> 57%
#> 62%
#> 67%
#> 73%
#> 78%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.
```

The result `out` is a list of two elements where `p.value` is the p-value of the test and `fc` an estimation of the common fold change. We can calculate the normalized data and write out the result as follows.
```{r}
d.norm <- countdata::normalize(d[, 1:6])

head(cbind(d.norm, out$p.value, out$fc))
#>      pre.1    pre.2    pre.3   post.1   post.2   post.3 out$p.value    out$fc
#> 1 594.2861 176.8823 347.1786 490.0689 164.2960 208.5462 0.064856368 -1.297663
#> 2 303.8611 242.1015 265.3067 384.2916 372.5316 270.9056 0.072525346  1.259570
#> 3 302.8275 278.6637 331.6333 361.0012 229.2502 208.5462 0.334902070 -1.168634
#> 4 313.1629 278.6637 259.0885 297.9231 232.1159 232.0588 0.045979358 -1.117328
#> 5 409.2823 267.7939 177.2166 317.3317 206.3252 105.2954 0.006665643 -1.356184
#> 6 245.9828 257.9122 280.8520 237.7562 223.5190 219.7913 0.048216659 -1.151104
```


## References

1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. _Journal of the Royal Statistical Society Series B_ 57, 289–300.

1. Pham TV, Piersma SR, Warmoes M, Jimenez CR (2010) On the beta binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics. _Bioinformatics_, 26(3):363-369.

1. Pham TV, Jimenez CR (2012) An accurate paired sample test for count data. _Bioinformatics_, 28(18):i596-i602.
