---
title: "Simulating a simulation study"
author: "Alessandro Gasparini"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulating a simulation study}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---
```{r setup, include = FALSE}
options(width = 150)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center", fig.height = 6, fig.width = 6,
  out.width = "75%"
)
```

# Introduction

In this vignette, we show how the simulated data included as an example dataset in `simsum` has been generated.

# Motivation

Say we want to run a simulation study in which we want to compare the sensitivity of parametric and semiparametric survival models on relative risk estimates.

# Data generating mechanisms

We simulate an hypothetical trial with a binary treatment. We fix the log-treatment effect to $-0.50$, and we generate a treatment indicator variable for each simulated individual via a $Binom(1, 0.5)$ random variable. We simulate two different sample sizes (50 and 250 individuals) and we assume two different baseline hazard functions: exponential with scale parameter $\lambda = 0.5$, and Weibull with scale parameter $\lambda = 0.5$ and shape parameter $\gamma = 1.5$. Finally, we apply administrative censoring at time $t = 5$.

```{r baseline-hazards}
exp_basehaz <- function(t, lambda = 0.5) lambda * 1 * t^0
exp_weibull <- function(t, lambda = 0.5, gamma = 1.5) lambda * gamma * t^(gamma - 1)
curve(exp_basehaz, from = 0, to = 5, lty = 1, ylim = c(0, 2), ylab = expression(h[0](t)), xlab = "Follow-up time t")
curve(exp_weibull, from = 0, to = 5, lty = 2, add = TRUE)
legend(x = "topleft", lty = 1:2, legend = c("Exponential baseline hazard", "Weibull baseline hazard"), bty = "n")
```

The survival times are estimated using the approach of Bender _et al_. (2005), based on drawing from a $U(0, 1)$ random variable and applying the following transformations:

1. for an exponential baseline hazard, the survival time $t$ is simulated as:
$$t = -\frac{log(U)}{\lambda \exp(\beta ^ T X)}$$

2. for a Weibull baseline hazard, the survival time $t$ is simulated as:
$$t = \left(-\frac{log(U)}{\lambda \exp(\beta ^ T X)}\right) ^ {1 / \gamma}$$

The R function to simulate a dataset for our simulation study is defined as follows:

```{r dgfun}
simulate_data <- function(dataset, n, baseline, params = list(), coveff = -0.50) {
  # Simulate treatment indicator variable
  x <- rbinom(n = n, size = 1, prob = 0.5)
  # Draw from a U(0,1) random variable
  u <- runif(n)
  # Simulate survival times depending on the baseline hazard
  if (baseline == "Exponential") {
    t <- -log(u) / (params$lambda * exp(x * coveff))
  } else {
    t <- (-log(u) / (params$lambda * exp(x * coveff)))^(1 / params$gamma)
  }
  # Winsorising tiny values for t (smaller than one day on a yearly-scale, e.g. 1 / 365.242), and adding a tiny amount of white noise not to have too many concurrent values
  t <- ifelse(t < 1 / 365.242, 1 / 365.242, t)
  t[t == 1 / 365.242] <- t[t == 1 / 365.242] + rnorm(length(t[t == 1 / 365.242]), mean = 0, sd = 1e-4)
  # ...and make sure that the resulting value is positive
  t <- abs(t)

  # Make event indicator variable applying administrative censoring at t = 5
  d <- as.numeric(t < 5)
  t <- pmin(t, 5)
  # Return a data.frame
  data.frame(dataset = dataset, x = x, t = t, d = d, n = n, baseline = baseline, stringsAsFactors = FALSE)
}
```

# Methods

We compare the Cox model (Cox, 1972) with a fully parametric survival model assuming an exponential baseline hazard and a flexible parametric model with 2 degrees of freedom for modelling the baseline hazard (Royston and Parmar, 2002). The Cox model can be fit via the `coxph` function from the `survival` package, the exponential model can be fit via the `phreg` function from the `eha` package, and the Royston-Parmar model can be fixed via the `stpm2` function from the `rstpm2` package.

# Performance measures

Say we are interested in the following performance measures:

* Bias in the estimated log-treatment effect, and corresponding $95\%$ Monte Carlo confidence
intervals
* Coverage of confidence intervals for the log-treatment effect, defined as the proportion of simulated data sets for which the true log-treatment effect of $-0.50$ lies within the $95\%$ confidence intervals obtained from the model

# Sample size

We are primarily interested in bias, and assume that the variance of the estimated log-treatment effect is $0.1$. The Monte Carlo standard error for the bias is:

$$\text{MCSE} = \sqrt{\frac{\text{variance}}{\# \text{simulations}}}$$

Aiming for a Monte Carlo standard error of 0.01 on the estimated bias, we would require $1,000$ replications. 

The Monte Carlo standard error for coverage is:

$$\text{MCSE} = \sqrt{\frac{\text{coverage} \times (1 - \text{coverage})}{\# \text{simulations}}}$$

This Monte Carlo standard error is maximised for a coverage = $0.5$. In that setting, the Monte Carlo standard error with $1,000$ replications would be $0.01581139$, which is deemed to be acceptable.

Therefore, we should run $1,000$ replications of this simulation study.
However, for simplicity, we will run $100$ replications only to speed up the process.

# Running the simulation study

## Generate data

We generate $100$ datasets for each data-generating mechanism.

First, we set a random seed for reproducibility:

```{r set-seed}
set.seed(755353002)
```

Then, we simulate the data:

```{r generate-data}
reps <- 1:100
data <- list()
data[["n = 50, baseline = Exp"]] <- lapply(
  X = reps,
  FUN = simulate_data,
  n = 50,
  baseline = "Exponential",
  params = list(lambda = 0.5)
)
data[["n = 250, baseline = Exp"]] <- lapply(
  X = reps,
  FUN = simulate_data,
  n = 250,
  baseline = "Exponential",
  params = list(lambda = 0.5)
)
data[["n = 50, baseline = Wei"]] <- lapply(
  X = reps,
  FUN = simulate_data,
  n = 50,
  baseline = "Weibull",
  params = list(lambda = 0.5, gamma = 1.5)
)
data[["n = 250, baseline = Wei"]] <- lapply(
  X = reps,
  FUN = simulate_data,
  n = 250,
  baseline = "Weibull",
  params = list(lambda = 0.5, gamma = 1.5)
)
```

## Run models

We define a function to fit the models of interest:

```{r fitting-function}
library(survival)
library(rstpm2)
library(eha)

fit_models <- function(data, model) {
  # Fit model
  if (model == "Cox") {
    fit <- survival::coxph(Surv(t, d) ~ x, data = data)
  } else if (model == "RP(2)") {
    fit <- rstpm2::stpm2(Surv(t, d) ~ x, data = data, df = 2)
  } else {
    fit <- eha::phreg(Surv(t, d) ~ x, data = data, dist = "weibull", shape = 1)
  }
  # Return relevant coefficients
  data.frame(
    dataset = unique(data$dataset),
    n = unique(data$n),
    baseline = unique(data$baseline),
    theta = coef(fit)["x"],
    se = sqrt(ifelse(model == "Exp", fit$var["x", "x"], vcov(fit)["x", "x"])),
    model = model,
    stringsAsFactors = FALSE,
    row.names = NULL
  )
}
```

We now run the models for each simulated dataset:

```{r run-models}
results <- list()
results[["n = 50, baseline = Exp, model = Cox"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Exp"]],
    FUN = fit_models,
    model = "Cox"
  )
)
results[["n = 250, baseline = Exp, model = Cox"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Exp"]],
    FUN = fit_models,
    model = "Cox"
  )
)
results[["n = 50, baseline = Wei, model = Cox"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Wei"]],
    FUN = fit_models,
    model = "Cox"
  )
)
results[["n = 250, baseline = Wei, model = Cox"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Wei"]],
    FUN = fit_models,
    model = "Cox"
  )
)

results[["n = 50, baseline = Exp, model = Exp"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Exp"]],
    FUN = fit_models,
    model = "Exp"
  )
)
results[["n = 250, baseline = Exp, model = Exp"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Exp"]],
    FUN = fit_models,
    model = "Exp"
  )
)
results[["n = 50, baseline = Wei, model = Exp"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Wei"]],
    FUN = fit_models,
    model = "Exp"
  )
)
results[["n = 250, baseline = Wei, model = Exp"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Wei"]],
    FUN = fit_models,
    model = "Exp"
  )
)

results[["n = 50, baseline = Exp, model = RP(2)"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Exp"]],
    FUN = fit_models,
    model = "RP(2)"
  )
)
results[["n = 250, baseline = Exp, model = RP(2)"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Exp"]],
    FUN = fit_models,
    model = "RP(2)"
  )
)
results[["n = 50, baseline = Wei, model = RP(2)"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 50, baseline = Wei"]],
    FUN = fit_models,
    model = "RP(2)"
  )
)
results[["n = 250, baseline = Wei, model = RP(2)"]] <- do.call(
  rbind.data.frame,
  lapply(
    X = data[["n = 250, baseline = Wei"]],
    FUN = fit_models,
    model = "RP(2)"
  )
)
```

## Aggregating results

```{r aggregate-results}
relhaz <- do.call(
  rbind.data.frame,
  results
)
row.names(relhaz) <- NULL
```

We save the final results, that will be included as an example in the R package `rsimsum`.

```{r saving, eval = FALSE}
library(usethis)
usethis::use_data(relhaz, overwrite = TRUE)
```

## Summarising results

Finally, we obtain summary statistics by calling the `simsum` function:

```{r compute-summaries}
library(rsimsum)
s <- rsimsum::simsum(data = relhaz, estvarname = "theta", se = "se", true = -0.50, methodvar = "model", ref = "Cox", by = c("n", "baseline"))
s
```

```{r print-summaries}
summary(s)
```

# Conclusions

With this vignette we showed how to simulate survival data and run a small, simple simulation study.

# References

* Cox D.R. _Regression models and life-tables_. Journal of the Royal Statistical Society, Series B (Methodological), 1972, 34(2):187-220

* Royston P. and Parmar M.K. _Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects_. Statistics in Medicine, 2002, 21(15):2175-2197 

* Bender R., Augustin T., and Blettner M. _Generating survival times to simulate Cox proportional hazards models_. Statistics in Medicine, 2005, 24(11):1713-1723