---
title: "Creating a cdm reference using Spark"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{a03_creating_a_cdm_reference}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  eval = FALSE
)
```

So far we've been using a local Spark connection for introducing the OmopOnSpark package. However, in practice, when working with patient-level health data our data will most likely be in the cloud-based Databricks plaform which is built around Apache Spark. Once we have created our cdm reference, the same code we have seen when working with a local Spark dataset will also work with Databricks. It is *just* that the way we connect will differ.

To create your connection to https://spark.posit.co/deployment/databricks-connect.html. Briefly, first you would save environmental variables. 

```{r}
usethis::edit_r_environ()

DATABRICKS_HOST = "Enter here your Workspace URL"
DATABRICKS_TOKEN = "Enter here your personal token" 
```

With these saved you should now be able to connect with sparklyr, specifying your cluster ID. 
```{r}
library(sparklyr)
con <- spark_connect(
  cluster_id = "Enter here your cluster ID",
  method = "databricks_connect"
)
con
```

With this, we can check that everything is working and we have an open connection 
```{r}
connection_is_open(con)
```

With this, we we should be able to create a reference to a table. Let's say we our OMOP CDM data is in a catalog called "my_catalog" and a schema called "my_omop_schema". We should be able to create a reference to our person table.

```{r}
library(dplyr)
tbl(con, I("my_catalog.my_omop_schema.person"))
```

We should be able to collect the first five rows of this table into R
```{r}
tbl(con, I("my_catalog.my_omop_schema.person")) |> 
  head(5) |> 
  collect()
```

As well as this we should be able to go in the other direction and copy data from R to a Spark dataframe. 
```{r}
spark_cars_df <- sdf_copy_to(con,
                             cars,
                             overwrite = TRUE)
spark_cars_df
```

If these basics are working we should be well set-up to start working with OmopOnSpark. Here we would spefify our cdm schema as we've seen above. And now let's say we have another schema called "my_results_schema" where we want to save any study-specific tables. We'll use this when specifying the write schema. In addition, we can also give a write prefix and all the tables we create during the course of the working with this cdm reference will start with this prefix. 

```{r, eval=FALSE}
library(OmopOnSpark)
cdm <- cdmFromSpark(con,
  cdmSchema = "my_catalog.my_omop_schema",
  writeSchema = "my_catalog.my_results_schema",
  writePrefix = "study_1_"
)
cdm
```

