---
title: "Getting Started with the ipd Package"
output: 
  BiocStyle::html_document:
    keep_md: true
vignette: >
  %\VignetteIndexEntry{Getting Started with the ipd Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{tidyverse}
  %\VignetteDepends{patchwork}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    error    = FALSE,
    warning  = FALSE,
    message  = FALSE,
    comment  = "#>"
)
```

```{r, echo=FALSE}
default_options <- options()
```

# Introduction

## Background

With the rapid advancement of artificial intelligence and machine learning 
(AI/ML), researchers from a wide range of disciplines increasingly use 
predictions from pre-trained algorithms as outcome variables in statistical 
analyses. However, reifying algorithmically-derived values as measured outcomes 
may lead to biased estimates and anti-conservative inference 
([Salerno et al., 2025](https://doi.org/10.48550/arXiv.2512.05456)). The 
statistical challenges encountered when drawing inference on predicted data 
(IPD) include: 

1. Understanding the relationship between predicted outcomes and their true, 
unobserved counterparts
2. Quantifying the robustness of the AI/ML models to resampling or uncertainty 
about the training data
3. Appropriately propagating both bias and uncertainty from predictions into 
downstream inferential tasks

Several works have proposed methods for IPD, including post-prediction 
inference (PostPI) by 
[Wang et al., 2020](https://doi.org/10.1073/pnas.2001238117), 
prediction-powered inference (PPI) and PPI++ by 
[Angelopoulos et al., 2023a](https://doi.org/10.1126/science.adi6000) and 
[Angelopoulos et al., 2023b](https://doi.org/10.48550/arXiv.2311.01453), 
post-prediction adaptive inference (PSPA) by 
[Miao et al., 2023](https://doi.org/10.48550/arXiv.2311.14220), and a 
correction based on the Chen and Chen method and alternate PPI "All" by 
[Gronsbell et al., 2025](https://doi.org/10.48550/arXiv.2411.19908). To enable 
researchers and practitioners interested in these state-of-the-art methods, we 
have developed `ipd`, a open-source `R` package that implements these methods 
under the umbrella of IPD.

This vignette provides a guide to using the `ipd` package, including 
installation instructions, examples of data generation, model fitting, and 
usage of custom methods. The examples demonstrate the package's functionality.

## Notation

Following the notation of 
[Miao et al., 2023](https://doi.org/10.48550/arXiv.2311.14220), we 
assume we have the following data structure:

1. Labeled data $L=\{Y^L,X^L,f(X^L)\}$; unlabeled data $U=\{X^U,f(X^U)\}$.
2. The labeled set is typically smaller in size compared to the unlabeled set. 
3. We have access to an algorithm $f(X)$ that can predict our outcome, $Y$.
4. Our interest is in performing inference on a quantity such as the outcome 
mean or quantile, or to recover a downstream inferential (mean) model:

$$\mathbb{E}\left[Y^{\mathcal{U}} \mid \boldsymbol{X}^{\mathcal{U}}\right] = 
g^{-1}\left(\boldsymbol{X}^{\mathcal{U}'}\beta\right),$$

where $\beta$ is a vector of regression coefficients and $g(\cdot)$ is a 
given link function, such as the identity link for linear regression, the 
logistic link for logistic regression, or the log link for Poisson regression. 
However, in practice, we do not observe $Y^\mathcal{U}$ in the 'unlabeled' 
subset of the data. Instead, these values are replaced by the predicted 
$f(X^\mathcal{U})$. We can use methods for IPD to obtain corrected estimates 
and standard errors when we replace these unobserved $Y^\mathcal{U}$ by 
$f(X^\mathcal{U})$.

## Installation

To install the `ipd` package from [Bioconductor](https://www.bioconductor.org/), you can use the `BiocManager` package:

```{r eval = FALSE}
#-- Install BiocManager if it is not already installed

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(version = "3.21")

#-- Install the ipd package from Bioconductor

BiocManager::install("ipd")
```

Or, to install the development version of `ipd` from 
[GitHub](https://github.com/ipd-tools/ipd), you can use the `devtools` package:

```{r, eval=FALSE}
#-- Install devtools if it is not already installed

install.packages("devtools")

#-- Install the ipd package from GitHub

devtools::install_github("ipd-tools/ipd")
```

# Usage

We provide a simple example to demonstrate the basic use of the functions 
included in the `ipd` package.

```{r setup}
#-- Load necessary libraries

library(ipd)
library(tidyverse)
library(patchwork)
```

## Data Generation

The `ipd` packages provides a unified function, `simdat`, for generating 
synthetic datasets for various models. The function currently supports "mean", 
"quantile", "ols", "logistic", and "poisson" models. 

### Function Arguments

- `n`: A vector of size 3 indicating the sample size in the training, labeled, 
and unlabeled data sets.
- `effect`: A float specifying the regression coefficient for the first 
variable of interest (defaults to 1).
- `sigma_Y`: A float specifying the residual variance for the generated outcome.
- `model`: The type of model to be generated. Must be one of `"mean"`, 
`"quantile"`, `"ols"`, `"logistic"`, or `"poisson"`.

The `simdat` function generate a data.frame with three subsets: (1) an 
independent "training" set with additional observations used to fit a 
prediction model, and "labeled" and "unlabeled" sets which contain the 
observed and predicted outcomes and the simulated features of interest. 

### Generating Data for Linear Regression

We can generate a continuous outcome and relevant predictors for linear 
regression as follows. The `simdat` function generates four independent 
covariates, $X_1$, $X_2$, $X_3$, and $X_4$, and the outcome:

$$Y = \text{effect}\times X_1 + \frac{1}{2}\times X_2^2 + 
\frac{1}{3}\times X_3^3 + \frac{1}{4}\times X_4^2 + \varepsilon_y$$

where `effect` is one of the function arguments and 
$\varepsilon_y \sim N(0, \text{sigma_Y})$, with `sigma_Y` being another 
argument. Here, the `simdat` function generates three subsets of data, a 
"training" subset, a "labeled" subset, and an "unlabeled" subset, based on 
the sizes in `n`. It then learns the prediction rule for the outcome in the 
"training" subset using a generalized additive model and predicts these 
outcomes in the "labeled" and "unlabeled" subsets:

```{r simols}}
#-- Generate a Dataset for Linear Regression

set.seed(123)

n <- c(10000, 500, 1000)

dat_ols <- simdat(n = n, effect = 1, sigma_Y = 4, model = "ols",
                  
    shift = 1, scale = 2)

#-- Print First 6 Rows of Training, Labeled, and Unlabeled Subsets

options(digits = 2)

head(dat_ols[dat_ols$set_label == "training", ])

head(dat_ols[dat_ols$set_label == "labeled", ])

head(dat_ols[dat_ols$set_label == "unlabeled", ])
```

The `simdat` function provides observed and unobserved outcomes for both the 
labeled and unlabeled datasets, though in practice the observed outcomes are 
not in the unlabeled set. We can visualize the relationships between these 
variables in the labeled data subset:

```{r plot, echo = FALSE, error = FALSE, warning = FALSE, message = FALSE, comment = NA, fig.height=3, dpi=900}
#-- Plot example labeled data

dat_labeled <- dat_ols[dat_ols$set_label == "labeled", ]

my_theme <- theme_bw() +

    theme(
        
        axis.title = element_text(size = 8),
            
        axis.text = element_text(size = 8),
            
        title = element_text(size = 8))

p1 <- ggplot(dat_labeled, aes(x = X1, y = Y)) + my_theme +
  
    coord_fixed(1 / 3) + geom_abline(slope = 1, intercept = 0) +

    geom_point(alpha = 0.5) + geom_smooth(method = "lm") +

    scale_x_continuous(limits = c(-2.5, 2.5)) +
      
    scale_y_continuous(limits = c(-7.5, 7.5)) +
      
    labs(x = "\nCovariate", y = "Observed Outcome\n")

p2 <- ggplot(dat_labeled, aes(x = X1, y = f)) +
  
    my_theme + coord_fixed(1 / 3) + geom_abline(slope = 1, intercept = 0) +
      
    geom_point(alpha = 0.5) + geom_smooth(method = "lm") +
      
    scale_x_continuous(limits = c(-2.5, 2.5)) +
      
    scale_y_continuous(limits = c(-7.5, 7.5)) +
      
    labs(x = "\nCovariate", y = "Predicted Outcome\n") 

p3 <- ggplot(dat_labeled, aes(x = f, y = Y)) +
    
    my_theme + coord_fixed(2 / 3) + geom_abline(slope = 1, intercept = 0) +
    
    geom_point(alpha = 0.5) + geom_smooth(method = "lm") +
    
    scale_x_continuous(limits = c(-5.0, 5.0)) +
    
    scale_y_continuous(limits = c(-7.5, 7.5)) +
    
    labs(x = "\nPredicted Outcome", y = "Observed Outcome\n")

fig1 <- (p1 + theme(plot.margin = unit(c(0, 0, 0, 0), "pt"))) +

    (p2 + theme(plot.margin = unit(c(0, 10, 0, 10), "pt"))) +
  
    (p3 + theme(plot.margin = unit(c(0, 10, 0, 0), "pt"))) +
  
    plot_annotation(tag_levels = "A")

fig1
```

We can see that:

-   The predicted outcomes are more correlated with the covariate than the true outcomes (panels A and B).
-   The predicted outcomes are not perfect substitutes for the true outcomes (panel C).

### Generating Data for Logistic Regression

As another example, we can generate a binary outcome and relevant predictors 
for logistic regression as follows:

```{r simlogistic}}
#-- Generate a Dataset for Logistic Regression

set.seed(123)

dat_logistic <- simdat(n = n, effect = 3, sigma_Y = 1, model = "logistic")

#-- Print First 6 Rows of Training, Labeled, and Unlabeled Subsets

head(dat_logistic[dat_logistic$set_label == "training", ])

head(dat_logistic[dat_logistic$set_label == "labeled", ])

head(dat_logistic[dat_logistic$set_label == "unlabeled", ])
```

```{r logsum, echo=FALSE}
dat_logistic_labeled <- dat_logistic[dat_logistic$set_label == "labeled", ]

dat_logistic_labeled_summ <- dat_logistic_labeled |>

    group_by(Y, f) |>

    count() |>

    ungroup() |>

    mutate(

        Y = factor(Y), f = factor(f),
        pct = n / sum(n) * 100,
        fill = if_else(Y == f, 1, 0))
```

We can again visualize the relationships between the true and predicted 
outcome variables in the labeled data subset and see that 
`r sprintf("%2.1f%%", sum(with(dat_logistic_labeled_summ, pct[fill == 1])))`
observations are correctly predicted:

```{r plot2, echo=FALSE, fig.width=7, dpi=900}
dat_logistic_labeled_summ |>

    ggplot(aes(x = f, y = Y, fill = fill)) +

        geom_tile() +

        coord_equal() +

        geom_text(aes(label = paste0(n, " (", sprintf("%2.1f", pct), "%)")),

            vjust = 1) +

        scale_x_discrete(expand = c(0, 0), limits = rev) +

        scale_y_discrete(expand = c(0, 0)) +

        scale_fill_gradient(high = "steelblue", low = "white") +
  
        labs(x = "\nPredicted Outcome", y = "Observed Outcome\n") +
  
        theme(legend.position = "none")
```

## Model Fitting

### Linear Regression

We compare two non-IPD approaches to analyzing the data to methods included in the `ipd` package. The two non-IPD benchmarks are the 'naive' method and the 'classic' method. The 'naive' treats the predicted outcomes as if they were observed and regresses the predictions on the covariates of interest without calibration. The 'classic' uses only the subset of labeled observations where we observe the true outcome. The IPD methods are listed in alphabetical order by method name. 

#### 'Naive' Regression Using the Predicted Outcomes

```{r naive}
#--- Fit the Naive Regression

lm(f ~ X1, data = dat_ols[dat_ols$set_label == "unlabeled", ]) |>

    summary()
```

#### 'Classic' Regression Using only the Labeled Data

```{r classic}
#--- Fit the Classic Regression

lm(Y ~ X1, data = dat_ols[dat_ols$set_label == "labeled", ]) |>

    summary()
```

You can fit the various IPD methods to your data and obtain summaries using the 
provided wrapper function, `ipd()`:

#### Chen and Chen Correction (Gronsbell et al., 2025)

```{r chen_ols}

#-- Specify the Formula

formula <- Y - f ~ X1

#-- Fit the Chen and Chen Correction

ipd::ipd(formula, method = "chen", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()

```

#### Prediction Decorrelated Inference (Gan et al., 2024)

```{r pdc_ols}

#-- Fit the PDC Correction

ipd::ipd(formula, method = "pdc", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()

```

#### PostPI Bootstrap Correction (Wang et al., 2020)

```{r postpi_boot_ols}

#-- Fit the PostPI Bootstrap Correction

nboot <- 200

ipd::ipd(formula, method = "postpi_boot", model = "ols", 

    data = dat_ols, label = "set_label", nboot = nboot) |>

    summary()
```

#### PostPI Analytic Correction (Wang et al., 2020)

```{r postpi_analytic_ols}
#-- Fit the PostPI Analytic Correction

ipd::ipd(formula, method = "postpi_analytic", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()
```

#### Prediction-Powered Inference (PPI; Angelopoulos et al., 2023)

```{r ppi_ols}
#-- Fit the PPI Correction

ipd::ipd(formula, method = "ppi", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()
```

#### PPI "All" (Gronsbell et al., 2025)


```{r ppi_a_ols}
#-- Fit the PPI Correction

ipd::ipd(formula, method = "ppi_a", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()
```


#### PPI++ (Angelopoulos et al., 2023)

```{r ppi_plusplus}
#-- Fit the PPI++ Correction

ipd::ipd(formula, method = "ppi_plusplus", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()
```

#### Post-Prediction Adaptive Inference (PSPA; Miao et al., 2023)

```{r pspa}
#-- Fit the PSPA Correction

ipd::ipd(formula, method = "pspa", model = "ols", 

    data = dat_ols, label = "set_label") |>

    summary()
```

### Logistic Regression

We also show how these methods compare for logistic regression.

#### 'Naive' Regression Using the Predicted Outcomes

```{r naive2}
#--- Fit the Naive Regression

glm(f ~ X1, family = binomial,

    data = dat_logistic[dat_logistic$set_label == "unlabeled", ]) |>

    summary()
```

#### 'Classic' Regression Using only the Labeled Data

```{r classic2}
#--- Fit the Classic Regression

glm(Y ~ X1, family = binomial, 

    data = dat_logistic[dat_logistic$set_label == "labeled", ]) |>

    summary()
```

You can again fit the various IPD methods to your data and obtain summaries 
using the provided wrapper function, `ipd()`:

#### PostPI Bootstrap Correction (Wang et al., 2020)

```{r postpi_boot2}
#-- Specify the Formula

formula <- Y - f ~ X1

#-- Fit the PostPI Bootstrap Correction

nboot <- 200

ipd::ipd(formula, method = "postpi_boot", model = "logistic",

    data = dat_logistic, label = "set_label", nboot = nboot) |>

    summary()
```

#### Prediction-Powered Inference (PPI; Angelopoulos et al., 2023)

```{r ppi2}
#-- Fit the PPI Correction

ipd::ipd(formula, method = "ppi", model = "logistic",

    data = dat_logistic, label = "set_label") |>

    summary()
```

#### PPI++ (Angelopoulos et al., 2023)

```{r ppi_plusplus2}
#-- Fit the PPI++ Correction

ipd::ipd(formula, method = "ppi_plusplus", model = "logistic",

    data = dat_logistic, label = "set_label") |>

    summary()
```

#### Post-Prediction Adaptive Inference (PSPA; Miao et al., 2023)

```{r pspa2}
#-- Fit the PSPA Correction

ipd::ipd(formula, method = "pspa", model = "logistic", 

    data = dat_logistic, label = "set_label") |>

    summary()
```

## Printing, Summarizing, and Tidying

The package also provides custom `print`, `summary`, `tidy`, `glance`, and 
`augment` methods to facilitate easy model inspection:

```{r methods}
#-- Fit the PostPI Bootstrap Correction

nboot <- 200

fit_postpi <- ipd::ipd(formula, method = "postpi_boot", model = "ols",

    data = dat_ols, label = "set_label", nboot = nboot)
```

### Print Method

The `print` method gives an abbreviated summary of the output from the `ipd` 
function:

```{r print}
#-- Print the Model

print(fit_postpi)
```

### Summary Method

The `summary` method gives more detailed information about the estimated 
coefficients, standard errors, and confidence limits:

```{r summary}
#-- Summarize the Model

summ_fit_postpi <- summary(fit_postpi)

#-- Print the Model Summary

print(summ_fit_postpi)
```

### Tidy Method

The `tidy` method organizes the model coefficients into a 
[tidy](https://broom.tidymodels.org/) format.

```{r tidy}
#-- Tidy the Model Output

tidy(fit_postpi)
```

### Glance Method

The `glance` method returns a one-row summary of the model fit.

```{r glance}
#-- Get a One-Row Summary of the Model

glance(fit_postpi)
```

### Augment Method

The `augment` method adds model predictions and residuals to the original 
dataset.

```{r augment}
#-- Augment the Original Data with Fitted Values and Residuals

augmented_df <- augment(fit_postpi)

head(augmented_df)
```

```{r, echo=FALSE}
options(default_options)
```

# Conclusions

The `ipd` package offers a suite of functions for conducting inference on 
predicted data. With custom methods for printing, summarizing, tidying, 
glancing, and augmenting model outputs, `ipd` streamlines the process of 
IPD-based inference in `R`. We will continue to develop this package to 
include more targets of inference and IPD methods as they are developed, 
as well as additional functionality for analyzing such data. For further 
information and detailed documentation, please refer to the function help 
pages within the package, e.g.,

```{r help, eval=FALSE}
?ipd
```

## Feedback

For questions, comments, or any other feedback, please contact the 
developers (<ssalerno@fredhutch.org>).

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on 
[GitHub](https://github.com/ipd-tools/ipd).

## License

This package is licensed under the MIT License.

## Session Info

```{r sessionInfo}
sessionInfo()
```

