---
title: "Intermediate rtables - Identifying Required Analysis Behavior"
subtitle: Contributed by Johnson & Johnson Innovative Medicine
date: "2025-10-22"
author:
- Gabriel Becker
- Dan Hofstaedter
output:
    rmarkdown::html_document:
        theme: "spacelab"
        highlight: "kate"
        toc: true
        toc_float: true
        code_folding: show
vignette: >
  %\VignetteIndexEntry{Intermediate rtables - Identifying Required Analysis Behavior}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  markdown:
      wrap: 72
  chunk_output_type: console
---

```{r, include = FALSE}
suggested_dependent_pkgs <- c("dplyr")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = all(vapply(
    suggested_dependent_pkgs,
    requireNamespace,
    logical(1),
    quietly = TRUE
  ))
)
```

```{r, echo=FALSE}
knitr::opts_chunk$set(comment = "#")
```


# Introduction

```{r init, echo = FALSE, results = "hidden"}
suppressPackageStartupMessages(library(rtables))
suppressPackageStartupMessages(library(dplyr))

knitr::opts_chunk$set(comment = "")
```

Analysis functions can do anything. While this provides one of the
backbones of `rtables`' power and flexibility, it also means that
analysis functions must be chosen (and designed) with care to ensure
our resulting tables contain the correct statistics.

While the diversity of tables is such that custom analysis functions
will inevitably be needed in some cases, we will not address the
*creation* of analysis functions in detail here; that is covered in
within the [advanced](./advanced_usage.html) portion of
this guided tour. Here we will focus on *selecting* analysis functions
and the reasoning that goes into that choice. While this might seem
like a strange distinction at first glance, this is in fact how most
tables will be created - when a template script or function is not
used to create them from whole cloth; packages like
[tern](https://cran.r-project.org/package=tern) and
[junco](https://cran.r-project.org/package=junco) provide libraries of
well-tested analysis functions that statistical programmers will often
simply select from when creating production outputs.


# Who Are Your Numbers And What Do They Do?

The details of what a desired analysis behavior calls for can be
categorized into two distinct types: specific and
fundamental. Specific analysis behaviors are things like which
particular statistical test or confidence interval method should be
used, how missing or censored data should be handled, etc.

Before specific details can be considered, however, the fundamental
aspects of the desired behavior must be determined. These come down to
three aspects fundamental to how the contents of our table are
intended to be interpreted:

1. The **type** of observation being *counted or analyzed*, 
2. The **(sub)population** we are *analyzing within* or *comparing against*, and
3. Whether the behavior is **conditional**, depending on the facet it is populating.

## There's Counting And Then There's Counting...

Consider the "simple" act of counting the frequency of different types
of adverse events (AEs) that occurred in a clinical trial; this could
mean one of two quite different things:

1. The number of **(unique) patients** which suffered a particular AE, or
2. The number of **times the AE was suffered**

These will generally give dramatically different numbers, as illustrated by the
table below, and one or both may be desired in a given table.


```{r}
double_count <- function(x) {
  in_rows(
    "Unique Patients" = length(unique(x)),
    "Total Events" = length(x)
  )
}

lyt_counts <- basic_table() |>
  split_cols_by("ARM") |>
  split_rows_by("AEBODSYS", split_fun = trim_levels_in_group("AEDECOD")) |>
  split_rows_by("AEDECOD") |>
  analyze("USUBJID", afun = double_count)

build_table(lyt_counts, ex_adae, alt_counts_df = ex_adsl)
```

## Denominators

Once we have defined counting, we often want to display percents
alongside the absolute counts. Even with the meaning of our counts
fixed, however, this can mean a number of different things, depending
on ***the (sub)population we are considering the count a percent
of***. Different denominators used in production can include the
overall relevant count for the:

1. column (often trial arm)
2. individual facet (row x column)
3. row group (across all columns)
4. ancestor row facet (across all columns)
5. ancestor row facet for a single column

Keeping in mind that each of the above might be either of the two
**types** of counting discussed above (generally events or patients in
a clinical trial setting).

Furthermore, these denominators can often need to be derived from the
`alt_counts_df` used to build the table when counting patients, rather
than the data being directly tabulated, due to the fact that not all
patients are guaranteed to appear in event based datasets (e.g.,
`ADAE`).

As before, we can see that these denominators can vary wildly, which
would cause the percents displayed to do so as well. We will use a
custom afun (`disp_denoms`) to explore the differences between the
above types of denominators. The details of how we implement the
`disp_denoms` function used below is not germane to our discussion
here but motivated readers can choose to inspect it below.


```{r, class.source="fold-hide"}
disp_denoms <- function(x, .var, .N_col, .df_row, .alt_df_full, .spl_context) {
  ## myfn <- RefFootnote("Patients observed in AE data", symbol = "*")
  parent_df <- .spl_context$full_parent_df[[3]]
  parent_df_onecol <- parent_df[eval(.spl_context$cur_col_expr[[1]], envir = parent_df), ]
  gparent_df <- .spl_context$full_parent_df[[2]]
  gparent_df_onecol <- gparent_df[eval(.spl_context$cur_col_expr[[1]], envir = gparent_df), ]
  cur_race <- .spl_context$value[2]
  cur_aebodsys <- .spl_context$value[3]
  cur_aedecod <- .spl_context$value[4]
  alt_df_race <- subset(.alt_df_full, RACE == cur_race)
  alt_df_race_col <- alt_df_race[eval(.spl_context$cur_col_expr[[1]], envir = alt_df_race), ]

  vals <- list(
    .N_col,
    length(x),
    length(unique(x)),
    NA,
    NROW(.df_row),
    length(unique(.df_row[[.var]])),
    NA,
    NROW(parent_df),
    length(unique(parent_df[[.var]])),
    NROW(parent_df_onecol),
    length(unique(parent_df_onecol[[.var]])),
    NROW(gparent_df),
    length(unique(gparent_df[[.var]])),
    NROW(alt_df_race),
    NROW(gparent_df_onecol),
    length(unique(gparent_df_onecol[[.var]])),
    NROW(alt_df_race_col)
  )
  names(vals) <- c(
    "Column N",
    "facet events",
    "facet patients",
    "facet patients - alt df",
    sprintf("%s events", cur_aedecod),
    sprintf("%s patients", cur_aedecod),
    sprintf("%s patients - alt df", cur_aedecod),
    sprintf("%s events (all arms)", cur_aebodsys),
    sprintf("%s patients (all arms)", cur_aebodsys),
    sprintf("%s events (this arm)", cur_aebodsys),
    sprintf("%s patients (this arm)", cur_aebodsys),
    sprintf("%s events (all arms)", cur_race),
    sprintf("%s patients (all arms)", cur_race),
    sprintf("%s patients (all arms) - alt df", cur_race),
    sprintf("%s events (this arms)", cur_race),
    sprintf("%s patients (this arms)", cur_race),
    sprintf("%s patients (this arms) - alt df", cur_race)
  )

  in_rows(.list = vals)
}
```

With `disp_denoms` defined, we can proceed with using it to see how
varied the options for denominator are when calculating a percent:

```{r}
map <- tribble(
  ~RACE, ~AEBODSYS, ~AEDECOD,
  "ASIAN", "cl B.2", "dcd B.2.2.3.1",
  "WHITE", "cl A.1", "dcd A.1.1.1.1"
)

lyt_denoms <- basic_table() |>
  split_cols_by("ARM") |>
  split_rows_by("RACE", split_fun = trim_levels_to_map(map)) |> ## keep_split_levels(c("ASIAN", "WHITE"))) |>
  split_rows_by("AEBODSYS") |> # , split_fun = trim_levels_in_group("AEDECOD")) |>
  split_rows_by("AEDECOD") |>
  analyze("USUBJID", afun = disp_denoms)

build_table(lyt_denoms, ex_adae, alt_counts_df = ex_adsl)
```

The above table displays the various denominators we might use across
both types of counting discussed in the previous section. Each of
these denominator choices would result in at least one difference in
calculated percent from all other choices across the full table
(barring any coincidental equality between count types in your
data). Thus when programming against a table shell with the ubiquitous
"xx (xx.x%)" family of formats, we must **carefully choose an afun
which calculates the *correct type of percent***.

Many existing analysis functions provided by production libraries such
as `tern` and `junco` support different denominator options for use
when calculating percents out of the box, though they do not all
support the full breadth of variety explored above. Nonetheless,
careful consideration of the existing analysis functions available to
you and what they support should be made before choosing to create a
custom one.

# "Yeah But What If" ... Conditionality In `afun`s

While many tables call for analysis functions which perform identical
calculations across all facets they are applied within, this is
**not** a requirement of the `rtables` framework. In this vignette we
will not discuss the *creation* of analysis functions which support
creating facet-conditional cell values; rather we will discuss
*identifying* when such analysis functions are required, leaving the
implementation of `afun`s with complex behavior to advanced portion(s)
of this tour.

## Conditionality Based On Column Facet

In point of fact, we saw conditionality on the current column in the
[previous vignette](./guided_intermediate_translating_shells.html)
when translating shells with risk difference columns; there the dummy
`afun` we used calculated count-percent values for facets in the "main
portion" of the table and displayed a custom string for facets in the
risk difference portion. In real tables that string would be replaced
with actual risk difference calculations, of course, but the
conditionality itself is the same.

Generally speaking, `afun`s we use will need this type of column-facet
conditionality any time we have a *heterogeneous column structure*
(i.e., any time we have non-nested column faceting in our layout). In
these cases analysts will need to use `afuns` specifically designed to
support the specific type of column structure their layout declares.

There is not currently a production-ready way to combine `afun`s into
a single function and declare the conditions under which each should
be used, though we we will explore this as an exercise in the advanced
portion of this tour.

## Conditionality Based On Row Facet

Unlike column-conditional `afun`s, purely row-conditional `afun`s do
not generate rows which have values with different meanings for
different cells in the row. Rather, row-conditionality manifests as
different row facets containing different numbers of rows and/or
entire rows or sets of rows with different types of content based on
the current row facet.

A common example of a shell with different numbers of rows depending
on the row facet is one summarizing patient visits (`AVISIT`). In such
tables, the row facet for the initial or `BASELINE` visit will often
have only a subset of those in the other facets, as those will contain
comparisons to the baseline visit.

```{r, class.source = "fold-hide"}
avisit_afun <- function(df, .var, .spl_context) {
  cur_visit <- tail(.spl_context$value, 1)
  vals <- list("Mean Patient DIABP" = mean(df[[.var]]))
  pardf <- head(.spl_context$full_parent_df, 1)[[1]]

  if (!(as.character(cur_visit) %in% c("SCREENING", "BASELINE"))) {
    pardf <- subset(pardf, AVISIT %in% c("BASELINE", cur_visit))
    difs <- tapply(
      seq_len(nrow(pardf)), pardf$USUBJID,
      function(iis) {
        avalvec <- pardf$AVAL[iis]
        bl <- which(as.character(pardf[iis, ]$AVISIT) == "BASELINE")
        mean(avalvec[-bl] - avalvec[bl])
      }
    )
    vals <- c(vals, list("Mean Diff From Patient's Baseline DIABP" = mean(difs)))
  }

  in_rows(.list = vals)
}
```

We can see this behavior using a basic row-conditional `afun` (defined above for those curious):
```{r}
lyt_rowcond <- basic_table() |>
  split_rows_by("AVISIT") |>
  analyze("AVAL", avisit_afun, format = "xx.xx")

build_table(lyt_rowcond, subset(ex_advs, PARAMCD == "DIABP"))
```

In the above table, we display only the mean of the measurement for
that visit (simulated `DIABP` in this case) for the `SCREENING` and
`BASELINE` visits, while we additionally display the mean of the
per-patient difference between their values for that visit and their
baseline visit for follow-up visits.

## Why Not Both?

It is important to note that while we described row- and column-facet
conditionality separately in this section, an `afun` can be both
simultaneously. In fact an `afun`'s column conditionality can itself
be dependent on which row facet it is in.

The most straightforward example of this type of "dual conditionality"
is when we have heterogeneous column structure, but the additional
columns should only be populated for some portions of the table and
left blank for others.

For this example we repeat the same data preparation as in the Risk
Difference Columns section [in the previous
chapter](./guided_intermediate_translating_shells.html) without
further comment.

```{r}
advs <- ex_advs
advs$span_label <- "Active Treatment"
advs$span_label[advs$ARM == "B: Placebo"] <- " "

span_label_map <- tribble(
  ~span_label, ~ARM,
  "Active Treatment", "A: Drug X",
  "Active Treatment", "C: Combination",
  " ", "B: Placebo",
)

advs$rr_header <- "Risk Differences"
advs$rr_label <- paste(substr(advs$ARM, 1, 1), "vs B")
```

We can then define an afun which populates the "risk difference" columns only for follow up visits and leaves them blank otherwise:

```{r, class.source = "fold-hide"}
in_risk_diff <- function(spl_context) grepl("Risk Differences", spl_context$cur_col_id[1])

avisit_afun2 <- function(df, .var, .spl_context) {
  in_rd <- in_risk_diff(.spl_context)

  cur_visit <- tail(.spl_context$value, 1)
  is_followup <- !(as.character(cur_visit) %in% c("SCREENING", "BASELINE"))
  if (!in_rd) {
    vals <- list("Mean Patient DIABP" = mean(df[[.var]]))
  } else if (!is_followup) {
    vals <- list("Mean Patient DIABP" = NULL)
  } else {
    vals <- list("Mean Patient DIABP" = rcell("-", format = "xx"))
  }
  pardf <- head(.spl_context$full_parent_df, 1)[[1]]

  if (is_followup) {
    if (!in_rd) {
      pardf <- subset(pardf, AVISIT %in% c("BASELINE", cur_visit))
      difs <- tapply(
        seq_len(nrow(pardf)), pardf$USUBJID,
        function(iis) {
          avalvec <- pardf$AVAL[iis]
          bl <- which(as.character(pardf[iis, ]$AVISIT) == "BASELINE")
          mean(avalvec[-bl] - avalvec[bl])
        }
      )
      vals <- c(vals, list("Mean Diff From Baseline" = mean(difs)))
    } else {
      armlabel <- tail(.spl_context$cur_col_split_val[[1]], 1) # last split value, ie arm
      armletter <- substr(armlabel, 1, 1)
      vals <- c(vals, list("Mean Diff From Baseline" = rcell(paste(armletter, "vs B"), format = "xx")))
    }
  }

  in_rows(.list = vals)
}
```


With that done, we can construct a table that has dual conditionality
in its cell contents.

```{r}
lyt_fullcond <- basic_table() |>
  split_cols_by("span_label", split_fun = trim_levels_to_map(span_label_map)) |>
  split_cols_by("ARM", show_colcounts = TRUE) |>
  split_cols_by("rr_header", nested = FALSE) |>
  split_cols_by("ARM", labels_var = "rr_label", split_fun = remove_split_levels("B: Placebo")) |>
  split_rows_by("AVISIT") |>
  analyze("AVAL", avisit_afun2, format = "xx.xx")


build_table(lyt_fullcond, subset(advs, PARAMCD == "DIABP"))
```

# A Final Note On `afun` Complexity

Many production `afun`s, such as those provided by `tern` or `junco`,
support multiple behaviors that are controlled user specified
arguments (passed via `extra_args`). This can be used to control
denominator types, choice of statistical methods, and anything else
the developers wish. This is a separate type of complexity from
behavior that is conditional on column or row structure -- and in fact
can be combined with such -- but it is nonetheless important to
consider when choosing which `afun` to use.
