---
title: "Exploring data with sumvar"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Exploring data with sumvar}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(sumvar)
library(ggplot2)
library(dplyr)

```

# Introduction

The **sumvar** package provides simple and easy to use tools for summarising continuous and categorical data, inspired by Stata's "sum" and "tab" commands. All functions are tidyverse/dplyr pipe-friendly and return tibbles.

# Why use sumvar?

* Simple one-line commands to explore variables for R users
* Pipe-friendly tidyverse integration
* Tabular summaries which can be stored as tibbles and used for downstream analysis.


When I first moved from Stata to R about 5 years ago, the main thing I missed was the simplicity of the "sum" and "tab" functions to efficiently explore data. Most template code to perform these commands, in introductory R books or tutorials eg. https://r4ds.hadley.nz/data-tidy.html, takes typically 3-5 lines to replicate these functions in R. I couldn't find a package that could quite as simply and efficiently explore data. 

Sumvar is fast and easy to use, and brings these variable summary functions to R.

# Continuous Data

We call **dist_sum()** to explore a continous variable. 

The tibble output shows: the number of rows in the data, and number missing, the median, interquartile range (25th and 75th centiles), mean, the standard deviation, and 95% confidence intervals using the Wald method (normal approximation), and the minimum and maximum values.

**Dist_sum()** will show a density plot and histogram for a single variable, or a grouped density plot when there is a grouping varialbe.

You can save the output from dist_sum as a tibble and use the estimates for downstream analysis, eg.  `sum_df <- df %>% dist_sum(age, sex)`

```{r continuous}
# Example data
set.seed(123)
df <- tibble::tibble(
  age = rnorm(100, mean = 50, sd = 20),
  sex = sample(c("male", "female"), 100, replace = TRUE)) %>%
  dplyr::mutate(age = dplyr::if_else(sex == "male", age + 10, age))

# Call dist_sum
df %>% dist_sum(age)
df %>% dist_sum(age, sex)
```

# Dates

To explore the distribution of dates, call **dist_date()** - it is similar to dist_sum. This can also be grouped by a second grouping variable. With a single date, a histogram is shown; when a grouping variable is also called, a density plot is shown.

```{r dates}
df3 <- tibble::tibble(
  dates = as.Date("2022-01-01") + rnorm(n=100, sd=50, mean=0),
  group = sample(c("A", "B"), 100, TRUE)) %>%
  dplyr::mutate(dt = dplyr::case_when(group == "A" ~ dates + 10, TRUE ~ dates))

df3 %>% dist_date(dates)
df3 %>% dist_date(dates, group)
```


# Categorical Data

**tab1()** produces a tibble showing the distribution of a categorical variable and illustrates using a horizontal bar chart. 

```{r categorical}
df2 <- tibble::tibble(
  group = sample(LETTERS[1:3], 200, TRUE)
)

df2 %>% tab1(group)
```


# Two-way tables

**tab()** creates a cross-tabulation of two categorical variables. By default it shows counts and row percentages, with row and column totals.

```{r crosstab}
df_tab <- dplyr::tibble(
  treatment = sample(c("control", "treatment"), 100, replace = TRUE),
  outcome   = sample(c("improved", "stable", "worse"), 100, replace = TRUE)
)

df_tab %>% tab(treatment, outcome)
df_tab %>% tab(treatment, outcome, show = "col")  # column percentages
df_tab %>% tab(treatment, outcome, test = "chi")  # with chi-squared test
result <- df_tab %>% tab(treatment, outcome)      # save as tibble
```


# Check for duplicate and missing data

To explore the proportion of duplicate values and missing values in a variable, pass it to **dup()**. 


```{r duplicate}
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.

example_data %>% dup(age)
```


If you send the whole database to **dup()**, it will produce a summary of duplicates and missingness in the whole database. **Dup()** illustrates with a stacked bar chart.

```{r duplicate_all}
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
                              sex = sample(c("Male", "Female"), 200, TRUE),
                              favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA  # Replace 32 values with missing.

dup(example_data)
```

## Automated reports with explorer()

`explorer()` generates a single HTML or PDF report summarising all variables in a data frame — continuous, date, categorical, duplicates, and missing data — in one step.

```{r explorer, eval=FALSE}
explorer(example_data)                      # HTML report (default)
explorer(example_data, format = "pdf")      # PDF report
explorer(example_data, id_var = "id")       # exclude an identifier column
```

When run interactively, `explorer()` will prompt you to confirm whether any columns named `id` or `pid` should be excluded from summaries. The report is written to the working directory and opened automatically in the browser.
