---
title: "Dynamic Data Definition"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Dynamic Data Definition}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r chunkname, echo=-1}
data.table::setDTthreads(2)
```

```{r, echo = FALSE, message = FALSE}
options(digits = 3)

library(simstudy)
library(ggplot2)
library(scales)
library(grid)
library(gridExtra)
library(survival)
library(gee)
library(data.table)
odds <- function (p)  p/(1 - p) # TODO temporary remove when added to package
plotcolors <- c("#B84226", "#1B8445", "#1C5974")

cbbPalette <- c("#B84226","#B88F26", "#A5B435", "#1B8446",
                "#B87326","#B8A526", "#6CA723", "#1C5974") 

ggtheme <- function(panelback = "white") {
  
  ggplot2::theme(
    panel.background = element_rect(fill = panelback),
    panel.grid = element_blank(),
    axis.ticks =  element_line(colour = "black"),
    panel.spacing =unit(0.25, "lines"),  # requires package grid
    panel.border = element_rect(fill = NA, colour="gray90"), 
    plot.title = element_text(size = 8,vjust=.5,hjust=0),
    axis.text = element_text(size=8),
    axis.title = element_text(size = 8)
  )  
  
}

```

Often, we'd like to explore data generation and modeling under different scenarios. For example, we might want to understand the operating characteristics of a model given different variance or other parametric assumptions. There is functionality built into `simstudy` to facilitate this type of dynamic exploration. First, the functions `updateDef` and `updateDefAdd` essentially allow us to edit lines of existing data definition tables. Second, there is a built-in mechanism - called *double-dot*  reference - to access external variables that do not exist in a defined data set or data definition.

## Updating existing definition tables

The `updateDef` function updates a row in a definition table created by functions `defData` or `defRead`. Analogously, `updateDefAdd` function updates a row in a definition table created by functions `defDataAdd` or `defReadAdd`. 

The original data set definition includes three variables `x`, `y`, and `z`, all normally distributed:

```{r}
defs <- defData(varname = "x", formula = 0, variance = 3, dist = "normal")
defs <- defData(defs, varname = "y", formula = "2 + 3*x", variance = 1, dist = "normal")
defs <- defData(defs, varname = "z", formula = "4 + 3*x - 2*y", variance = 1, dist = "normal")

defs
```

In the first case, we are changing the relationship of `y` with `x` as well as the variance:

```{r}
defs <- updateDef(dtDefs = defs, changevar = "y", newformula = "x + 5", newvariance = 2)
defs
```

In this second case, we are changing the distribution of `z` to *Poisson* and updating the link function to *log*:

```{r}
defs <- updateDef(dtDefs = defs, changevar = "z", newdist = "poisson", newlink = "log")
defs
```

And in the last case, we remove a variable from a data set definition. Note in the case of a definition created by `defData` that it is not possible to remove a variable that is a predictor of a subsequent variable, such as `x` or `y` in this case.

```{r}
defs <- updateDef(dtDefs = defs, changevar = "z", remove = TRUE)
defs
```

## Double-dot external variable reference

For a truly dynamic data definition process, `simstudy` (as of `version 0.2.0`) allows users to reference variables that exist outside of data generation. These can be thought of as a type of hyperparameter of the data generation process. The reference is made directly in the formula itself, using a double-dot ("..") notation before the variable name. Here is a simple example:

```{r}
def <- defData(varname = "x", formula = 0, 
  variance = 5, dist = "normal")
def <- defData(def, varname = "y", formula = "..B0 + ..B1 * x", 
  variance = "..sigma2", dist = "normal")

def
```

```{r}
B0 <- 4;
B1 <- 2;
sigma2 <- 9

set.seed(716251)

dd <- genData(100, def)

fit <- summary(lm(y ~ x, data = dd))

coef(fit)
fit$sigma
```

It is easy to create a new data set on the fly with a difference variance assumption without having to go to the trouble of updating the data definitions.

```{r}
sigma2 <- 16

dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))

coef(fit)
fit$sigma
```

The double-dot notation can be flexibly applied using `lapply` (or the parallel version `mclapply`) to create a range of data sets under different assumptions:

```{r, fig.width = 5}
sigma2s <- c(1, 2, 6, 9)

gen_data <- function(sigma2, d) {
  dd <- genData(200, d)
  dd$sigma2 <- sigma2
  dd
}

dd_4 <- lapply(sigma2s, function(s) gen_data(s, def))
dd_4 <- rbindlist(dd_4)

ggplot(data = dd_4, aes(x = x, y = y)) +
  geom_point(size = .5, color = "grey30") +
  facet_wrap(sigma2 ~ .) +
  theme(panel.grid = element_blank())
```

## Using non-scalar double-dot variable reference

The double-dot notation is also *array-friendly*. For example if we want to create a mixture distribution from a vector of values (which we can also do using a *categorical* distribution), we can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4:

```{r}
defblk <- defData(varname = "blksize", 
   formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")

defblk
```

```{r}
sizes <- c(2, 4)
genData(1000, defblk)
```

In this second example, there is a vector variable *tau* of positive real numbers that sum to 1, and we want to calculate the weighted average of three numbers using *tau* as the weights. We could use the following code to estimate a weighted average *theta*:

```{r}
tau <- rgamma(3, 5, 2)
tau <- tau / sum(tau)
tau

d <- defData(varname = "a", formula = 3, variance = 4)
d <- defData(d, varname = "b", formula = 8, variance = 2)
d <- defData(d, varname = "c", formula = 11, variance = 6)
d <- defData(d, varname = "theta", formula = "..tau[1]*a + ..tau[2]*b + ..tau[3]*c", 
  dist = "nonrandom")

set.seed(1)
genData(4, d)
```

We can simplify the calculation of *theta* by using matrix multiplication:

```{r}
d <- updateDef(d, changevar = "theta", newformula = "t(..tau) %*% c(a, b, c)")

set.seed(1)
genData(4, d)
```

These arrays can also have **multiple dimensions**, as in a $2 \times 2$ matrix. If we want to specify the mean outcomes for a factorial study design with two interventions $a$ and $b$, we can use a simple matrix and draw the means directly from the matrix, which in this example is stored in the variable *effect*:

```{r}
effect <- matrix(c(0, 4, 5, 7), nrow = 2)
effect
```

Using double dot notation, it is possible to reference the matrix cell values directly:

```{r}
d1 <- defData(varname = "a", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "b", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "outcome", formula = "..effect[a, b]", dist="nonrandom")
```

```{r}
dx <- genData(1000, d1)
dx
```

It is possible to generate normally distributed data based on these means:

```{r}
d1 <- updateDef(d1, "outcome", newvariance = 9, newdist = "normal")
dx <- genData(1000, d1)
```

The plot shows the individual values as well as the mean values by intervention arm:

```{r, echo=FALSE}
dsum <- dx[, .(outcome=mean(outcome)), keyby = .(a, b)]

ggplot(data = dx, aes(x = factor(a), y = outcome)) +
  geom_jitter(aes(color = factor(b)), width = .2, alpha = .4, size = .2) +
  geom_point(data = dsum, size = 2, aes(color = factor(b))) + 
  geom_line(data = dsum, linewidth = 1, aes(color = factor(b), group = factor(b))) +
  scale_color_manual(values = cbbPalette, name = "  b") +
  theme(panel.grid = element_blank()) +
  xlab ("a")
```