---
title: "WFDB: An Introduction to the Waveform Database Software Package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{WFDB: An Introduction to the Waveform Database Software Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(EGM)
```

Historically `{EGM}` wrapped the [Waveform Database (WFDB) software package](https://physionet.org/content/wfdb/10.7.0/), a well documented suite of `C` and `C++` applications for interacting with signal data.
Recent releases of `{EGM}` now provide native readers, writers, and annotation utilities implemented directly in `R` and `C++`, so the core functionality of the package no longer depends on an external WFDB installation.
You can work with WFDB headers, signals, and annotations on any system where `{EGM}` is installed, including CRAN environments.

Because `{EGM}` ships its own WFDB readers and writers, no additional system dependencies are required to begin analysing WFDB data in `R`.
Installing `{EGM}` from CRAN (or the development version from GitHub) is sufficient for the examples in this guide to run end-to-end on any platform.

If you would still like to interface with the traditional `C`-based WFDB tools—for example to compare results or to call PhysioNet utilities that have not yet been ported—you can follow the instructions in the [WFDB GitHub repository](https://github.com/bemoody/wfdb).
The `C`-based WFDB is more robust and has a full set of tools and functionality not found in the `R` version, however the key features of reading and writing data and annotators is compatible with the `R` implementation used by `{EGM}`.

# Signal Data 

### Understanding the WFDB Storage Format

WFDB files store signal data as __integer values__ (called __digital__ units) in binary `.dat` files.
These integers represent the raw output from an analog-to-digital converter (ADC).
To convert these integer values to meaningful __physical__ units (like millivolts), the WFDB header file (`.hea`) contains conversion metadata:

- __`gain`__: ADC units per physical unit (e.g., 200 ADC units per mV)
- __`baseline`__: The ADC value corresponding to 0 physical units
- __`units`__: The physical unit name (e.g., "mV")

The conversion formula between digital and physical units is:

$$
\text{physical} = \frac{\text{digital} - \text{baseline}}{\text{gain}}
$$

And the inverse formula for writing:

$$
\text{digital} = (\text{physical} \times \text{gain}) + \text{baseline}
$$

### Reading Data in Different Units

The `read_wfdb()` and `read_signal()` functions support a `units` parameter that controls how signal data is returned:

- `units = "digital"` (default): Returns raw ADC counts as stored in the file
- `units = "physical"`: Returns values converted to physical units (e.g., mV)

Let's demonstrate this concept using a WFDB file:

```{r, eval=TRUE}
# Locate the WFDB files in the package
dir_path <- system.file('extdata', package = 'EGM')

# Read the same data in both digital and physical units
ecg_digital <- read_wfdb(
  record = "muse-sinus",
  record_dir = dir_path,
  units = "digital"
)

ecg_physical <- read_wfdb(
  record = "muse-sinus",
  record_dir = dir_path,
  units = "physical"
)

# Examine the header to see the conversion parameters
head(ecg_digital$header)
#>   file_name storage_format ADC_gain ADC_baseline ADC_units ADC_resolution
#> 1  muse-sinus.dat      16      200            0        mV             16
#> 2  muse-sinus.dat      16      200            0        mV             16
#> 3  muse-sinus.dat      16      200            0        mV             16
#> 4  muse-sinus.dat      16      200            0        mV             16
#> 5  muse-sinus.dat      16      200            0        mV             16
#> 6  muse-sinus.dat      16      200            0        mV             16
```

Notice the `ADC_gain` and `ADC_baseline` columns in the header.
In this example, all channels have a gain of 200 ADC units/mV and a baseline of 0.
However, many real-world recordings have non-zero baseline values to account for DC offsets in the recording equipment.

## Comparing Digital and Physical Values

Let's compare the first few samples from lead II in both formats:

```{r, eval=TRUE}
# Extract the first 10 samples from lead II
digital_values <- ecg_digital$signal$II[1:10]
physical_values <- ecg_physical$signal$II[1:10]

# Get the conversion parameters for lead II from the header
lead_ii_row <- ecg_digital$header[ecg_digital$header$label == "II", ]
gain <- lead_ii_row$ADC_gain
baseline <- lead_ii_row$ADC_baseline

# Create a comparison table
comparison <- data.frame(
  sample = 1:10,
  digital = digital_values,
  physical = round(physical_values, 3),
  calculated_physical = round((digital_values - baseline) / gain, 3)
)

print(comparison)
#>    Sample Digital Physical Calculated
#> 1       1       5    0.025      0.025
#> 2       2       5    0.025      0.025
#> 3       3       5    0.025      0.025
#> 4       4      10    0.050      0.050
#> 5       5      10    0.050      0.050
#> 6       6      10    0.050      0.050
#> 7       7      10    0.050      0.050
#> 8       8      15    0.075      0.075
#> 9       9      20    0.100      0.100
#> 10     10      24    0.120      0.120

# Verify the conversion formula
cat("\nLead II conversion parameters:\n")
cat("  Gain:", gain, "ADC units/mV\n")
cat("  Baseline:", baseline, "ADC units\n")
#> Lead II conversion parameters:
#>   Gain: 200 ADC units/mV
#>   Baseline: 0 ADC units
```

The calculated physical units are derived from the digital units using the ADC conversion, and thus maintains the fidelity of the original signal.

# Annotations

The WFDB software package utilizes a binary format to store annotations.
Annotations are essentially markers or qualifiers of specific components of a signal, specifying both the specific time or position in the plot, and what channel the annotation refers to.
Annotations are polymorphic, and multiple can be applied to a single signal dataset.
The credit for this work goes directly to the original software creators, as this is just a wrapper to allow for flexible integration into `R`.
All of the demonstrations below rely on the native `{EGM}` parsers, so they run identically whether or not external WFDB binaries are present.

To begin, let's take an example of a simple ECG dataset.
This data is included in the package, and can be accessed as below.

```{r}
fp <- system.file('extdata', 'muse-sinus.xml', package = 'EGM')
ecg <- read_muse(fp)
fig <- ggm(ecg) + theme_egm_light()
fig
```

We may have customized software, manual approaches, or machine learning models that may label signal data.
We can use the `annotation_table()` function to create `WFDB`-compatible annotations, updating both the `egm` object and the written file.
To trial this, let's label the peaks of the QRS complex from this 12-lead ECG.

1. Create a quick, non-robust function for labeling QRS complex peaks. The function `pracma::findpeaks()` is quite good, but to avoid dependencies we are writing our own.
1. Evaluate the fit of the peaks to the dataset
1. Place the annotations into a table, updating the `egm` object
1. Plot the results

```{r}
# Let x = 10-second signal dataset
# We will apply this across the dataset
# This is an oversimplified approach.
find_peaks <- function(x,
                       threshold = 
                         mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE)
                       ) {
  
  # Ensure signal is "positive" for peak finding algorithm
  x <- abs(x)
  
  # Find the peaks
  peaks <- which(diff(sign(diff(x))) == -2) + 1
  
  # Filter the peaks
  peaks <- peaks[x[peaks] > threshold]
  
  # Return
  peaks
}

# Create a signal dataset
dat <- extract_signal(ecg)

# Find the peaks
sig <- dat[["I"]]
pk_loc <- find_peaks(sig)
pk_val <- sig[pk_loc]
pks <- data.frame(x = pk_loc, y = pk_val)

# Plot them
plot(sig, type = "l")
points(x = pks$x, y = pks$y, col = "orange")
```

The result is *not bad* for a simple peak finder, but it lets us generate a small dataset of annotations that can then be used.
Please see the additional vignettes for advanced annotation options, such as multichannel plots, and multichannel annotations.
We can take a look under the hood at the annotation positions we generated.
The relevant arguments, which are displayed below, include:

- annotator: name of annotation function or creator
- time: constructed from sample number and frequency
- sample: integer index of positions
- type: a single character describing the type
- subtype: a single character describing the type
- channel: which channel the data is mapped to
- number: an additional qualifier of the annotation type

```{r}
# Find the peaks
raw_signal <- dat[["I"]]
peak_positions <- find_peaks(raw_signal)
peak_positions

# Annotations do not need to store the value at that time point however
# The annotation table function has the following arguments
args(annotation_table)

# We can fill this in as below using additional data from the original ECG
hea <- ecg$header
start <- attributes(hea)$record_line$start_time
hz <- attributes(hea)$record_line$frequency

ann <- annotation_table(
  annotator = "our_pks",
  sample = peak_positions,
  type = "R",
  frequency = hz,
  channel = "I"
)

# Here are our annotations
ann

# Then, add this back to the original signal
ecg$annotation <- ann
ecg
```

