---
title: "Accessing Non-Integrated Datasets"
author: "Helen Miller"
date: "2022-06-15"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Accessing Non-Integrated Datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---



Many studies include data from assays which have not been integrated into the DataSpace. Some of these are available as "Non-Integrated Datasets," which can be downloaded from the app as a zip file. `DataSpaceR` provides an interface for accessing non-integrated data from studies where it is available.

## Viewing available non-integrated data

Methods on the DataSpace Study object allow you to see what non-integrated data may be available before downloading it. We will be using HVTN 505 as an example.


```r
library(DataSpaceR)
con <- connectDS()
vtn505 <- con$getStudy("vtn505")
vtn505
#> <DataSpaceStudy>
#>   Study: vtn505
#>   URL: https://dataspace.cavd.org/CAVD/vtn505
#>   Available datasets:
#>     - Binding Ab multiplex assay
#>     - Demographics
#>     - Intracellular Cytokine Staining
#>     - Neutralizing antibody
#>   Available non-integrated datasets:
#>     - ADCP
#>     - Demographics (Supplemental)
#>     - Fc Array
```

The print method on the study object will list available non-integrated datasets. The `availableDatasets` property shows some more info about available datasets, with the `integrated` field indicating whether the data is integrated. The value for `n` will be `NA` for non-integrated data until the dataset has been loaded.


```r
knitr::kable(vtn505$availableDatasets)
```



|name         |label                           |     n|integrated |
|:------------|:-------------------------------|-----:|:----------|
|BAMA         |Binding Ab multiplex assay      | 10260|TRUE       |
|Demographics |Demographics                    |  2504|TRUE       |
|ICS          |Intracellular Cytokine Staining | 22684|TRUE       |
|NAb          |Neutralizing antibody           |   628|TRUE       |
|ADCP         |ADCP                            |    NA|FALSE      |
|DEM SUPP     |Demographics (Supplemental)     |    NA|FALSE      |
|Fc Array     |Fc Array                        |    NA|FALSE      |

## Loading non-integrated data

Non-Integrated datasets can be loaded with `getDataset` like integrated data. This will unzip the non-integrated data to a temp directory and load it into the environment.


```r
adcp <- vtn505$getDataset("ADCP")
dim(adcp)
#> [1] 378  11
colnames(adcp)
#>  [1] "study_prot"             "participant_id"         "study_day"             
#>  [4] "lab_code"               "specimen_type"          "antigen"               
#>  [7] "percent_cv"             "avg_phagocytosis_score" "positivity_threshold"  
#> [10] "response"               "assay_identifier"
```

You can also view the file format info using `getDatasetDescription`. For non-integrated data, this will open a pdf into your computer's default pdf viewer.


```r
vtn505$getDatasetDescription("ADCP")
```

Non-integrated data is downloaded to a temp directory by default. There are a couple of ways to override this if desired. One is to specify `outputDir` when calling `getDataset` or `getDatasetDescription`.

If you will be accessing the data at another time and don't want to have to re-download it, you can change the default directory for the whole study object with `setDataDir`.


```r
vtn505$dataDir
#> [1] "/tmp/RtmpoDO8Tc"
vtn505$setDataDir(".")
vtn505$dataDir
#> [1] "/home/jmtaylor/Projects/DataSpaceR/vignettes"
```

If the dataset already exists in the specified `dataDir` or `outputDir`, it will be not be downloaded. This can be overridden with `reload=TRUE`, which forces a re-download.


## Session information


```r
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
#>  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
#>  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
#>  [7] LC_PAPER=en_US.utf8       LC_NAME=C                
#>  [9] LC_ADDRESS=C              LC_TELEPHONE=C           
#> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.2 DataSpaceR_0.7.5  knitr_1.37       
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8       digest_0.6.29    assertthat_0.2.1 R6_2.5.1        
#>  [5] jsonlite_1.8.0   magrittr_2.0.2   evaluate_0.15    highr_0.9       
#>  [9] httr_1.4.2       stringi_1.7.6    curl_4.3.2       tools_4.1.2     
#> [13] stringr_1.4.0    Rlabkey_2.8.3    xfun_0.29        compiler_4.1.2
```
