---
title: "Introduction to betaregscale"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to betaregscale}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 5,
  fig.align = "center",
  warning = FALSE,
  message = FALSE,
  digits = 4
)

kbl10 <- function(x, n = 10, digits = 4, ...) {
  knitr::kable(utils::head(as.data.frame(x), n), digits = digits, align = "c", ...)
}
```

## Overview

The **betaregscale** package provides maximum-likelihood estimation of beta
regression models for responses derived from bounded rating scales.
Common examples include pain intensity scales (NRS-11, NRS-21, NRS-101),
Likert-type scales, product quality ratings, and any instrument whose
response can be mapped to the open interval $(0,1)$.



The key idea is that a discrete score recorded on a bounded scale carries
measurement uncertainty inherent to the instrument. For instance, a pain
score of $y=6$ on a 0–10 NRS is not an exact value but rather
represents a range: after rescaling to $(0,1)$, the observation is
treated as interval-censored in $[0.55,0.65]$. The package uses the
beta distribution to model such data, building a complete likelihood that
supports mixed censoring types within the same dataset.

## Installation

```{r install, eval = FALSE}
# Development version from GitHub:
# install.packages("remotes")
remotes::install_github("evandeilton/betaregscale")
```

```{r library}
library(betaregscale)
```

## Censoring types

The complete likelihood supports four censoring
types, automatically classified by `brs_check()`:

| $\delta$ | Type | Likelihood contribution |
|:--------:|:-----|:-----------------------|
| 0 | Exact (uncensored) | $f(y_i;a_i,b_i)$ |
| 1 | Left-censored ($y=0$) | $F(u_i;a_i,b_i)$ |
| 2 | Right-censored ($y=K$) | $1-F(l_i;a_i,b_i)$ |
| 3 | Interval-censored | $F(u_i;a_i,b_i)-F(l_i;a_i,b_i)$ |

where $f(\cdot)$ and $F(\cdot)$ are the beta density and CDF,
$[l_i,u_i]$ are the interval endpoints, and $(a_i,b_i)$ are the beta
shape parameters derived from $\mu_i$ and $\phi_i$ via the chosen
reparameterization.

## Interval construction



Scale observations are mapped to $(0,1)$ with midpoint uncertainty intervals:

$$y_t=y/K,\quad\text{interval }[y_t-h/K,y_t+h/K]$$

where $K$ is the number of scale categories (`ncuts`) and $h$ is the
half-width (`lim`, default **0.5**).

```{r check-response}
# Illustrate brs_check with a 0-10 NRS scale
y_example <- c(0, 3, 5, 7, 10)
cr <- brs_check(y_example, ncuts = 10)
kbl10(cr)
```

The `delta` column shows that $y=0$ is left-censored ($\delta=1$),
$y=10$ is right-censored ($\delta=2$), and all interior values are
interval-censored ($\delta=3$).

## Data preparation with `brs_prep()`

In practice, analysts may want to supply their own censoring indicators
or interval endpoints rather than relying on the automatic classification
of `brs_check()`. The `brs_prep()` function provides a flexible,
validated bridge between raw analyst data and `brs()`.

It supports four input modes:

### Mode 1: Score only (automatic)

```{r bs-prepare-mode1}
# Equivalent to brs_check - delta inferred from y
d1 <- data.frame(y = c(0, 3, 5, 7, 10), x1 = rnorm(5))
kbl10(brs_prep(d1, ncuts = 10))
```

### Mode 2: Score + explicit censoring indicator

```{r bs-prepare-mode2}
# Analyst specifies delta directly
d2 <- data.frame(
  y     = c(50, 0, 99, 50),
  delta = c(0, 1, 2, 3),
  x1    = rnorm(4)
)
kbl10(brs_prep(d2, ncuts = 100))
```

### Mode 3: Interval endpoints with NA patterns

When the analyst provides `left` and/or `right` columns, censoring is
inferred from the NA pattern:

```{r bs-prepare-mode3}
d3 <- data.frame(
  left  = c(NA, 20, 30, NA),
  right = c(5, NA, 45, NA),
  y     = c(NA, NA, NA, 50),
  x1    = rnorm(4)
)
kbl10(brs_prep(d3, ncuts = 100))
```

### Mode 4: Analyst-supplied intervals

When the analyst provides `y`, `left`, and `right` simultaneously, their
endpoints are used directly (rescaled by $K$):

```{r bs-prepare-mode4}
d4 <- data.frame(
  y     = c(50, 75),
  left  = c(48, 73),
  right = c(52, 77),
  x1    = rnorm(2)
)
kbl10(brs_prep(d4, ncuts = 100))
```

### Using brs_prep with `brs()`

Data processed by `brs_prep()` is automatically detected by
`brs()` - the internal `brs_check()` step is skipped:

```{r prepare-workflow}
set.seed(42)
n <- 1000
dat <- data.frame(x1 = rnorm(n), x2 = rnorm(n))
sim <- brs_sim(
  formula = ~ x1 + x2, data = dat,
  beta = c(0.2, -0.5, 0.3), phi = 0.3,
  link = "logit", link_phi = "logit",
  repar = 2
)
# Remove left, right, yt so brs_prep can rebuild them
prep <- brs_prep(sim[-c(1:3)], ncuts = 100)
fit_prep <- brs(y ~ x1 + x2,
  data = prep, repar = 2,
  link = "logit", link_phi = "logit"
)
summary(fit_prep, digits = 4)
```

## Example 1: Fixed dispersion model

### Simulating data

We simulate observations from a beta regression model with fixed
dispersion, two covariates, and a logit link for the mean.

```{r sim-fixed}
set.seed(4255)
n <- 1000
dat <- data.frame(x1 = rnorm(n), x2 = rnorm(n))

sim_fixed <- brs_sim(
  formula  = ~ x1 + x2,
  data     = dat,
  beta     = c(0.3, -0.6, 0.4),
  phi      = 1 / 10,
  link     = "logit",
  link_phi = "logit",
  ncuts    = 100,
  repar    = 2
)

kbl10(head(sim_fixed, 8))
```

Each observation is centered in its interval. For example, a score of 67
on a 0–100 scale yields $y_t=0.67$ with the interval $[0.665,0.675]$.

### Fitting the model

```{r fit-fixed}
fit_fixed <- brs(
  y ~ x1 + x2,
  data     = sim_fixed,
  link     = "logit",
  link_phi = "logit",
  repar    = 2
)
summary(fit_fixed)
```

The summary output follows the `betareg` package style, showing separate
coefficient tables for the mean and precision submodels, with Wald
z-tests and $p$-values based on the standard normal distribution.

### Goodness of fit

```{r gof-fixed}
kbl10(brs_gof(fit_fixed))
```

### Comparing link functions

The package supports several link functions for the mean submodel. We
can compare them using information criteria:

```{r compare-links}
links <- c("logit", "probit", "cauchit", "cloglog")
fits <- lapply(setNames(links, links), function(lnk) {
  brs(y ~ x1 + x2, data = sim_fixed, link = lnk, repar = 2)
})

# Estimates
est_table <- do.call(rbind, lapply(names(fits), function(lnk) {
  e <- brs_est(fits[[lnk]])
  e$link <- lnk
  e
}))
kbl10(est_table)

# Goodness of fit
gof_table <- do.call(rbind, lapply(fits, brs_gof))
kbl10(gof_table)
```

### Residual diagnostics

The `plot()` method provides six diagnostic panels. By default, the
first four are shown:

```{r plot-fixed, fig.height = 6}
plot(fit_fixed, caption = NULL, gg = TRUE, title = NULL, theme = ggplot2::theme_bw())
```

For ggplot2 output (requires the **ggplot2** package):

```{r plot-fixed-gg, eval = requireNamespace("ggplot2", quietly = TRUE), fig.height = 6}
plot(fit_fixed, gg = TRUE)
```

### Predictions

```{r predict-fixed}
# Fitted means
kbl10(
  data.frame(mu_hat = head(predict(fit_fixed, type = "response"))),
  digits = 4
)

# Conditional variance
kbl10(
  data.frame(var_hat = head(predict(fit_fixed, type = "variance"))),
  digits = 4
)

# Quantile predictions
kbl10(predict(fit_fixed, type = "quantile", at = c(0.10, 0.25, 0.5, 0.75, 0.90)))
```

### Confidence intervals

Wald confidence intervals based on the asymptotic normal approximation:

```{r confint-fixed}
kbl10(confint(fit_fixed))
kbl10(confint(fit_fixed, model = "mean"))
```

### Censoring structure

The `brs_cens()` function provides a visual and tabular
overview of the censoring types in the fitted model:

```{r censoring-summary, fig.height = 5}
brs_cens(fit_fixed, gg = TRUE, inform = TRUE)
```

## Example 2: Variable dispersion model

In many applications, the dispersion parameter $\phi$ may depend on
covariates. The package supports variable-dispersion models using the
`Formula` package notation: `y ~ x1 + x2 | z1 + z2`, where the terms
after `|` define the linear predictor for $\phi$. The same `brs_sim()`
function is used for fixed and variable dispersion; the second formula
part activates the precision submodel in simulation.

### Simulating data

```{r sim-variable}
set.seed(2222)
n <- 1000
dat_z <- data.frame(
  x1 = rnorm(n),
  x2 = rnorm(n),
  x3 = rbinom(n, size = 1, prob = 0.5),
  z1 = rnorm(n),
  z2 = rnorm(n)
)

sim_var <- brs_sim(
  formula = ~ x1 + x2 + x3 | z1 + z2,
  data = dat_z,
  beta = c(0.2, -0.6, 0.2, 0.2),
  zeta = c(0.2, -0.8, 0.6),
  link = "logit",
  link_phi = "logit",
  ncuts = 100,
  repar = 2
)

kbl10(head(sim_var, 10))
```

### Fitting the model

```{r fit-variable}
fit_var <- brs(
  y ~ x1 + x2 | z1,
  data     = sim_var,
  link     = "logit",
  link_phi = "logit",
  repar    = 2
)
summary(fit_var)
```

Notice the `(phi)_` prefix in the precision coefficient names, following
the `betareg` convention.

### Accessing coefficients by submodel

```{r coef-variable}
# Full parameter vector
round(coef(fit_var), 4)

# Mean submodel only
round(coef(fit_var, model = "mean"), 4)

# Precision submodel only
round(coef(fit_var, model = "precision"), 4)

# Variance-covariance matrix for the mean submodel
kbl10(vcov(fit_var, model = "mean"))
```

### Comparing link functions (variable dispersion)

```{r compare-links-var}
links <- c("logit", "probit", "cauchit", "cloglog")
fits_var <- lapply(setNames(links, links), function(lnk) {
  brs(y ~ x1 + x2 | z1, data = sim_var, link = lnk, repar = 2)
})

# Estimates
est_var <- do.call(rbind, lapply(names(fits_var), function(lnk) {
  e <- brs_est(fits_var[[lnk]])
  e$link <- lnk
  e
}))
kbl10(est_var)

# Goodness of fit
gof_var <- do.call(rbind, lapply(fits_var, brs_gof))
kbl10(gof_var)
```

### Diagnostics for variable dispersion

```{r plot-variable, fig.height = 6}
plot(fit_var)
```

## Advanced analyst functions

The package includes analyst-facing helpers for uncertainty quantification,
effect interpretation, score-scale communication, and predictive validation.

### Parametric bootstrap confidence intervals



```{r bootstrap-fixed}
set.seed(101)
boot_ci <- brs_bootstrap(
  fit_fixed,
  R = 100,
  level = 0.95,
  ci_type = "bca",
  keep_draws = TRUE
)
kbl10(head(boot_ci, 10))
```

```{r bootstrap-forest, eval = requireNamespace("ggplot2", quietly = TRUE)}
autoplot.brs_bootstrap(
  boot_ci,
  type = "ci_forest",
  title = "Bootstrap (BCa) vs Wald intervals"
)
```

### Average marginal effects (AME)



```{r marginaleffects-fixed}
set.seed(202)
ame <- brs_marginaleffects(
  fit_fixed,
  model = "mean",
  type = "response",
  interval = TRUE,
  n_sim = 120,
  keep_draws = TRUE
)
kbl10(ame)
```

```{r marginaleffects-plot, eval = requireNamespace("ggplot2", quietly = TRUE)}
autoplot.brs_marginaleffects(ame, type = "forest")
```

### Score probabilities on the original scale

```{r scoreprob-fixed}
prob_scores <- brs_predict_scoreprob(fit_fixed, scores = 0:10)
kbl10(prob_scores[1:6, 1:6])
kbl10(
  data.frame(row_sum = rowSums(prob_scores)[1:6]),
  digits = 4
)
```

### Repeated k-fold cross-validation

```{r cv-fixed}
set.seed(303) # For cross-validation reproducibility
cv_res <- brs_cv(
  y ~ x1 + x2,
  data = sim_fixed,
  k = 5,
  repeats = 5,
  repar = 2,
)
kbl10(cv_res)
kbl10(
  data.frame(
    metric = c("log_score", "rmse_yt", "mae_yt"),
    mean = c(
      mean(cv_res$log_score, na.rm = TRUE),
      mean(cv_res$rmse_yt, na.rm = TRUE),
      mean(cv_res$mae_yt, na.rm = TRUE)
    )
  ),
  digits = 4
)
```

## S3 methods reference

The following standard S3 methods are available for objects of class
`"brs"`:

| Method | Description |
|:-------|:-----------|
| `print()` | Compact display of call and coefficients |
| `summary()` | Detailed output with Wald tests and goodness-of-fit |
| `coef(model=)` | Extract coefficients (full, mean, or precision) |
| `vcov(model=)` | Variance-covariance matrix (full, mean, or precision) |
| `confint(model=)` | Wald confidence intervals |
| `logLik()` | Log-likelihood value |
| `AIC()`, `BIC()` | Information criteria |
| `nobs()` | Number of observations |
| `formula()` | Model formula |
| `model.matrix(model=)` | Design matrix (mean or precision) |
| `fitted()` | Fitted mean values |
| `residuals(type=)` | Residuals: response, pearson, rqr, weighted, sweighted |
| `predict(type=)` | Predictions: response, link, precision, variance, quantile |
| `plot(gg=)` | Diagnostic plots (base R or ggplot2) |

## Reparameterizations

The package supports three reparameterizations of the beta distribution,
controlled by the `repar` argument:

**Direct (`repar = 0`):** Shape parameters $a=\mu$ and $b=\phi$
are used directly. This is rarely used in practice.

**Precision (`repar = 1`, Ferrari & Cribari-Neto, 2004):** The mean
$\mu\in(0,1)$ and precision $\phi>0$ yield $a=\mu\phi$ and
$b=(1-\mu)\phi$. Higher $\phi$ means less variability.

**Mean–variance (`repar = 2`):** The mean $\mu\in(0,1)$ and
dispersion $\phi\in(0,1)$ yield $a=\mu(1-\phi)/\phi$ and
$b=(1-\mu)(1-\phi)/\phi$. Here $\phi$ acts as a coefficient of
variation: smaller $\phi$ means less variability.

```{r reparam}
# Precision parameterization: mu = 0.5, phi = 10 (high precision)
brs_repar(mu = 0.5, phi = 10, repar = 1)

# Mean-variance parameterization: mu = 0.5, phi = 0.1 (low dispersion)
brs_repar(mu = 0.5, phi = 0.1, repar = 2)
```

## References

- Ferrari, S. L. P., and Cribari-Neto, F. (2004). Beta regression for
  modelling rates and proportions. *Journal of Applied Statistics*,
  **31**(7), 799–815. DOI: 10.1080/0266476042000214501.
  Validated online via:
  <https://doi.org/10.1080/0266476042000214501>.

- Smithson, M., and Verkuilen, J. (2006). A better lemon squeezer?
  Maximum-likelihood regression with beta-distributed dependent variables.
  *Psychological Methods*, **11**(1), 54–71.
  DOI: 10.1037/1082-989X.11.1.54.
  Validated online via:
  <https://doi.org/10.1037/1082-989X.11.1.54>.

- Hawker, G. A., Mian, S., Kendzerska, T., and French, M. (2011).
  Measures of adult pain: Visual Analog Scale for Pain (VAS Pain),
  Numeric Rating Scale for Pain (NRS Pain), McGill Pain Questionnaire (MPQ),
  Short-Form McGill Pain Questionnaire (SF-MPQ), Chronic Pain Grade Scale
  (CPGS), Short Form-36 Bodily Pain Scale (SF-36 BPS), and Measure of
  Intermittent and Constant Osteoarthritis Pain (ICOAP).
  *Arthritis Care and Research*, **63**(S11), S240–S252.
  DOI: 10.1002/acr.20543.
  Validated online via:
  <https://doi.org/10.1002/acr.20543>.

- Hjermstad, M. J., Fayers, P. M., Haugen, D. F., et al. (2011).
  Studies comparing numerical rating scales, verbal rating scales, and
  visual analogue scales for assessment of pain intensity in adults:
  a systematic literature review.
  *Journal of Pain and Symptom Management*, **41**(6), 1073–1093.
  DOI: 10.1016/j.jpainsymman.2010.08.016.
  Validated online via:
  <https://doi.org/10.1016/j.jpainsymman.2010.08.016>.
  
