---
title: "Using corrr with databases"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using corrr with databases}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(dplyr)
library(dbplyr)
library(corrr)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

To calculate correlations with data inside databases, it is very common to import the data into R and then run the analysis.  This is not a desirable path, because of the overhead created by copying the data into memory.  

Taking advantage of the latest features offered by `dbplyr` and `rlang`, it is now possible to run the correlation calculation **inside the database**, thus avoiding importing the data.

## An example

A simple SQLite database will be used to this example.  A temporary database is created, and the `mtcars` data set is copied to it.  The `db_mtcars` variable is only a pointer to the new table inside the database, it does not hold any data. 

```{r}
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")
db_mtcars <- copy_to(con, mtcars)
```

Even though it is not a formal `data.frame` object, `db_mtcars` can be use as if it was a `data.frame` and simply pipe it into the `correlate()` function.

The `correlate()` function will check the type of object passed, if it is a database-backed table, meaning a `tbl_sql()` object class, then it will use the new `tidyeval` code to calculate the correlations inside the database.  The function will automatically retrieve only the results of the operation.

```{r}
library(dplyr)
library(corrr)
db_mtcars %>%
  correlate(quiet = TRUE)
```

The `tidyeval`-based function ensures that a `cor_df` object is returned, so then it can be used with other functions in the `corrr` package.

```{r}
db_mtcars %>%
  correlate(quiet = TRUE) %>%
  rplot()
```

## `sparklyr`

For connections using `sparklyr`, `corrr` will use that package function called `ml_corr()` to run all of the correlations at the same time.  That is all done under the hood.  The user just needs to pass a `tbl_spark` object to the `correlate()` function, and `corrr` will automatically select the right function to run.


## Limitations

When using `correlate()` with a database-backed table, please make sure to keep the following items in mind:

- Only the **pearson** method is supported.  It is the default method, so it is not necessary to pass it.  But if a different method is chosen, then the operation will return an error.

- Only pairwise complete observations are used. Meaning that the `use` argument has to be set to `pairwise.complete.obs`. 

- The `y` argument is not supported.  This means that if 1-to-1 comparisons are needed, then pre-select the two variables prior passing it to the `correlate()` function.

- The `diagonal` argument only accepts NA's. 
