---
title: "Introduction to DEmixR"
author: "Farrokh Habibzadeh"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette: default
  pdf_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to DEmixR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

# Introduction

The **DEmixR** package provides tools for fitting and evaluating **two-component mixture models** (normal and lognormal) using robust global optimization via Differential Evolution (`DEoptim`), with local refinement by quasi-Newton optimization.

It is designed to be more **robust** than standard EM-based methods, particularly in challenging situations where the mixture components overlap strongly or have complex likelihood surfaces.

# Installation

You can install the released version of `DEmixR` from CRAN with:

```{r, eval = FALSE}
install.packages("DEmixR")
```

# Overview of Functions

`prelim_plots()` – Diagnostic plots (histogram, QQ, PP, log-QQ)
`select_best_mixture()` – Compare lognormal vs normal mixtures using BIC
`fit_norm2()` – Fit a two-component **normal mixture**
`fit_lognorm2()` – Fit a two-component **lognormal mixture**
`bootstrap_mix2()` – Bootstrap mixture parameters (parametric or nonparametric)
`evaluate_init()` – Refine initial parameter values with local optimization

# When to Use Which Function

Use `prelim_plots()` for initial data exploration and distribution assessment
Use `select_best_mixture()` when unsure whether **normal** or **lognormal** distribution is appropriate
Use `fit_norm2()`/`fit_lognorm2()` when you know the appropriate distribution family
Use `bootstrap_mix2()` for uncertainty quantification and confidence intervals
Use `evaluate_init()` for testing specific starting values or refining solutions

# Two-Component Normal Mixture

We first generate synthetic data from a two-component **normal** mixture and draw the basic plots with `DEmixR`.

# Diagnostic Plots

```{r}
# Load the package
library(DEmixR)

# Simulation data
set.seed(123)
x <- c(rnorm(3000, mean = 0, sd = 1),
       rnorm(2000, mean = 3, sd = 1.2))

# Diagnostic plots
prelim_plots(x, which = c("hist", "qq"))
```

# Model Selection

```{r}
# Automatically choose the best mixture family
best_model <- select_best_mixture(x, n_runs = 3, NP = 50, itermax = 2000)
best_model$best$family
best_model$BICs
```
 
# Normal Mixture

```{r}
# Fit a two-component normal mixture
fit <- fit_norm2(x, n_runs = 5, quiet = 2, parallelType = 0)

fit$par      # estimated parameters
fit$logLik   # log-likelihood
fit$AIC      # Akaike Information Criterion
fit$BIC      # Bayesian Information Criterion
```

# Lognormal Mixture

```{r}
# Simulation data
set.seed(123)
y <- c(rlnorm(3000, meanlog = 0, sdlog = 0.5),
       rlnorm(2000, meanlog = 1, sdlog = 0.7))

# Fit a two-component lognormal mixture
fit_ln <- fit_lognorm2(y, n_runs = 5, parallelType = 0)
fit_ln$par
```

# Bootstrap

```{r}
# Bootstrap confidence intervals
boot_res <- bootstrap_mix2(fit_ln, B = 100, parametric = TRUE, parallelType = 0)
boot_res$central
boot_res$ci
```

# Evaluate Initialization Values

```{r}
evaluate_init(par_init = c(0.5, 0, 0.6, 1.1, 0.8), y, family = "lognormal")
```

# Practical Example: Biomarker Data Analysis

```{r}
# Simulated biomarker data with two populations
set.seed(456)
biomarker <- c(rlnorm(4000, meanlog = 2.5, sdlog = 0.4),  # Healthy group
               rlnorm(1000, meanlog = 3.8, sdlog = 0.6))  # Disease group

# Initial exploration
prelim_plots(biomarker, which = c("hist", "logqq"))

# Fit lognormal mixture
fit_bio <- try(fit_lognorm2(biomarker, n_runs = 5, parallelType = 0, quiet = 2))

if (!inherits(fit_bio, "try-error")) {
  cat("Biomarker analysis results:\n")
  print(fit_bio$par)
  
  # Estimate proportion in each group
  cat("\nEstimated proportion in group 1 (healthy):", 
      round(fit_bio$par[1], 3), "\n")
  cat("Estimated proportion in group 2 (disease):", 
      round(1 - fit_bio$par[1], 3), "\n")
  
  # Bootstrap for confidence intervals
  boot_bio <- try(bootstrap_mix2(fit_bio, B = 100, quiet = 2))
  if (!inherits(boot_bio, "try-error")) {
    cat("\n95% CI for proportion in group 1:\n")
    print(boot_bio$ci[, "p"])
  }
}
```

# Advanced Usage

This section illustrates some of the optional parameters for fine-tuning performance and diagnostics.

## Fitting with `fit_*`

```r
# Fit lognormal mixture with custom optimization settings
fit_ln <- fit_lognorm2(
  x,
  NP = 150,           # population size for DEoptim
  n_runs = 20,        # independent runs to avoid local optima
  itermax = 200,      # maximum iterations per run
  parallelType = 1,   # use parallelization across platforms
  quiet = 1,          # print minimal progress info
  par_init = NULL     # optionally supply initial values
)
```

**Key options:**
- `NP`: larger = better exploration, but slower.
- `n_runs`: increase for more robustness; with 1000, it is very slow, but have ultra robustness
- `parallelType`: `0 = serial`, `1 = socket clusters (Windows/Mac/Linux)`, `2 = forking (Mac/Linux only)`.
- `quiet`: `0 = verbose`, `1 = minimal`, `2 = silent`.

## Bootstrapping with `bootstrap_mix2`

```r
boot_res <- bootstrap_mix2(
  fit_ln,
  B = 500,            # number of bootstrap replications
  parallelType = 1,   # parallel computation
  parametric = TRUE,  # parametric bootstrap
  ci_level = 0.90,    # confidence interval level
  quiet = 1           # show progress bar
)

boot_res$central   # means for normal and medians for lognormal
boot_res$ci        # bootstrap confidence intervals
```

**Notes:**
- Supports both parametric and nonparametric bootstrap.
- Uses an internal function to avoid label switching.

## Preliminary Plots with `prelim_plots`

```r
prelim_plots(
  x,
  which = c("hist", "qq", "pp", "logqq"),
  hist_bins = 40,
  col_hist = "grey85",
  col_density = "darkorange",
  col_qq = "grey60",
  col_line = "darkorange"
)
```

**Options:**
- `which`: choose one or more of `"hist", "qq", "pp", "logqq"`.
- `hist_bins`: number of bins for histogram.
- `col_*`: customize plot colors.

## Model Selection with `select_best_mixture`

```r
sel <- select_best_mixture(
  x,
  n_runs = 5,
  NP = 100,
  itermax = 150,
  quiet = 2
)

sel$best$family
sel$best$par
```

**What it does:**
- Fits both Normal and Lognormal mixtures.
- Compares them via BIC.
- Returns the best-performing family.

## Evaluating Initial Values with `evaluate_init`

```r
init_eval <- evaluate_init(
  x,
  family = "normal",
  n_starts = 10,      # number of random initializations
  NP = 50,            # DEoptim population size
  itermax = 100,      # maximum iterations
  quiet = 2           # suppress output
)

print(init_eval)
```

**Usage:**
- Helps check the stability of optimization.
- Useful to diagnose poor convergence.

# Summary

The **DEmixR** package provides robust and flexible tools for mixture model analysis, 
making it a strong alternative to existing EM-based approaches. 
Its use of **global optimization** helps avoid local minima and improve reliability, 
especially in complex or overlapping mixtures.

**Note:**
Harmless socket connection warnings may appear when using parallelization.

# Key Features

1. **Robust estimation:** Differential Evolution optimization avoids local minima
2. **Comprehensive diagnostics:** Visualization, model selection, and bootstrap uncertainty
3. **Flexible:** Supports both normal and lognormal distributions
4. **User-friendly:** Sensible defaults with options for customization
5. **Efficient:** Optional parallel computation for faster results

# Best Practices

1. Always start with `prelim_plots()` to understand your data distribution
2. Use `select_best_mixture()` when uncertain about distribution family
3. Increase `n_runs` for challenging optimization problems
4. Use bootstrap methods for uncertainty quantification
5. Test different initial values with `evaluate_init()` if convergence is problematic
For more information, see the help files for each function: `?fit_norm2`, `?bootstrap_mix2`, etc.

# References

1. Mullen, K.M., et al. (2011). DEoptim: An R Package for Global Optimization by Differential Evolution. Journal of Statistical Software, 40(6), 1-26.
2. R Core Team (2024). R: A Language and Environment for Statistical Computing.
