---
title: "EpidigiR: Digital Epidemiological Analysis and Visualization Tools"
author: "Esther Atsabina Wanjala"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{EpidigiR: Digital Epidemiological Analysis and Visualization Tools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction to EpidigiR: Epidemiological Analysis and Visualization

EpidigiR is an R package for epidemiological analysis, modeling, and visualization...

EpidigiR is an R package for epidemiological analysis, modeling, and visualization, designed with minimal dependencies and comprehensive functionality. It provides three main functions to cover 12 epidemiological topics, including a digital epidemiology aspect that leverages real-time data integration and advanced computational techniques to enhance disease tracking and prediction.

-   **epi_analyze**: Performs summary statistics, SIR modeling, DALY calculations, age standardization, diagnostic test evaluation, and NLP keyword extraction.
-   **epi_model**: Handles clinical trial power calculation, survival analysis, SNP association, logistic regression, k-means clustering, Random Forest, and SVM.
-   **epi_visualize**: Creates visualizations for prevalence mapping, epidemic curves, scatter plots, and boxplots.

The package includes nine datasets to support these analyses: epi_prevalence, sir_data, geno_data, ml_data, nlp_data, clinical_data, daly_data, survey_data, diagnostic_data, and survival_data.

This vignette demonstrates how to use these functions and datasets for various epidemiological tasks.

## Setup

```{r setup, include=FALSE}
    knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# Required packages
required_packages <- c(
  "deSolve", "sp", "tm", "glmnet", "caret",
  "kernlab", "survival", "randomForest", "EpidigiR"
)

missing_pkgs <- required_packages[!vapply(required_packages, requireNamespace, logical(1L), quietly = TRUE)]
if (length(missing_pkgs) > 0) {
  message("Missing packages: ", paste(missing_pkgs, collapse = ", "))
  message("Install them manually before running this vignette: install.packages(missing_pkgs)")
}

for (pkg in required_packages) {
  if (requireNamespace(pkg, quietly = TRUE)) {
    suppressPackageStartupMessages(library(pkg, character.only = TRUE))
  }
}
# Prepare datasets
if (exists("ml_data")) {
  ml_data$outcome <- as.factor(ml_data$outcome)
}
if (exists("clinical_data")) {
  clinical_data$outcome <- as.factor(clinical_data$outcome)
}
```

## Datasets

The package includes the following datasets:

-   **epi_prevalence**: Disease prevalence by region and age group, with spatial coordinates (12 rows).
-   **sir_data**: Simulated SIR model output (50 rows).
-   **geno_data**: Genotype and case-control data for SNP analysis (100 rows).
-   **ml_data**: Patient data for machine learning (logistic regression, clustering, Random Forest, SVM; 100 rows).
-   **nlp_data**: Epidemiological text data for NLP (100 rows).
-   **clinical_data**: Clinical trial data for power calculations and outcome analysis (200 rows).
-   **daly_data**: Data for DALY calculations (20 rows).
-   **survey_data**: Data for age standardization (20 rows).
-   **diagnostic_data**: Data for diagnostic test evaluation (10 rows).
-   **survival_data**: Data for survival analysis (100 rows).

## Examples

## Summary Statistics

```{r}
data(epi_prevalence)
result <- epi_analyze(
  epi_prevalence,
  outcome = "cases",
  population = "population",
  group = "region",
  type = "summary"
)
print(result)
```

## SIR Epidemic Model

```{r}
sir_result <- epi_analyze(
  data = NULL, outcome = NULL, type = "sir",
  N = 1000, beta = 0.3, gamma = 0.1, days = 50
)
epi_visualize(sir_result, x = "time", y = "Infected", type = "curve", main = "Epidemic Curve")
```

## Spatial map

```{r}
data(epi_prevalence)
coordinates(epi_prevalence) <- ~lon + lat
epi_visualize(epi_prevalence, x = "prevalence", type = "map", main = "Prevalence Map")
```

## Logistic Model

```{r}
data(clinical_data)
clinical_data$outcome <- as.factor(clinical_data$outcome)
model <- epi_model(clinical_data, formula = outcome ~ age + health_score + dose, type = "logistic")
head(model$predictions)
```

## Random Forest with Clinical Data

```{r}
rf_model <- epi_model(clinical_data, formula = outcome ~ age + health_score + dose, type = "rf")
head(rf_model$predictions)
```

## Global Health Burden (DALY)

```{r}
data(daly_data)
epi_analyze(daly_data, outcome = NULL, type = "daly")
```

## SNP Association

```{r}
data(geno_data)
epi_model(geno_data, formula = outcome ~ snp1 + snp2, type = "snp")
```

## Age Standardization

```{r}
data(survey_data)
epi_analyze(survey_data, outcome = NULL, type = "age_standardize")
```

## Machine-learning-logistic

```{r}
data(ml_data)
epi_model(ml_data, formula = outcome ~ age + exposure + genetic_risk, type = "logistic")
```

## Survival Analysis

Perform survival analysis using survival_data.

```{r}
data(survival_data)
epi_model(survival_data, type = "survival")
```

## NLP-keyword Extraction

```{r}
data(nlp_data)
nlp_result <- epi_analyze(nlp_data, outcome = NULL, population = NULL, type = "nlp", n = 5)
head(nlp_result)
```

### K-means Clustering

```{r}
data(ml_data)
epi_model(ml_data[, c("age", "exposure", "genetic_risk")], type = "kmeans", k = 3)
```

## SVM-Modelling

```{r}
data(ml_data)
ml_data$outcome <- as.factor(ml_data$outcome)
svm_model <- epi_model(ml_data, formula = outcome ~ age + exposure + genetic_risk, type = "svmRadial")
svm_model$performance
```

## Diagnostic Tests

```{r}
data(diagnostic_data)
epi_analyze(diagnostic_data, outcome = NULL, type = "diagnostic")
```

## boxplot-visualization

```{r}
data(clinical_data)
epi_visualize(clinical_data, x = "arm", y = "outcome", type = "boxplot", main = "Outcome by Treatment Arm")
```

## Scatter-visualization

```{r}
data(ml_data)
epi_visualize(ml_data, x = "age", y = "outcome", type = "scatter", main = "Age vs. Disease Outcome")
```

## Conclusion

EpidigiR offers a streamlined yet powerful toolkit for epidemiological analysis, featuring three key functions—epi_analyze, epi_model, and epi_visualize—and nine datasets that address all major topics. These tools support a range of analyses, from SIR modeling to sophisticated machine learning methods such as Random Forest and SVM. Furthermore, it integrates a digital epidemiology component, utilizing real-time data and advanced computational approaches to improve disease monitoring and forecasting, providing a valuable resource for researchers and analysts.

## License

EpidigiR is released under the MIT License © 2025 Esther Atsabina Wanjala.