---
title: "Introduce dlookr"
author: "Choonghyun Ryu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduce dlookr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r environment, echo = FALSE, message = FALSE, warning=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70)
options(tibble.print_min = 4L, tibble.print_max = 4L)

library(dlookr)
library(dplyr)
library(ggplot2)
```

## Preface
After you have acquired the data, you should do the following:

* Diagnose data quality.
    + If there is a problem with data quality,
    + The data must be corrected or re-acquired.
* Explore data to understand the data and find scenarios for performing the analysis.
* Derive new variables or perform variable transformations.

The dlookr package makes these steps fast and easy:

* Performs a data diagnosis or automatically generates a data diagnosis report.
* Discover data in various ways and automatically generate EDA(exploratory data analysis) reports.
* Impute missing values and outliers, resolve skewed data, and binaries continuous variables into categorical variables. And generates an automated report to support it.

dlookr increases synergy with `dplyr`. Particularly in data exploration and data wrangling, it increases the efficiency of the `tidyverse` package group.

## Supported data structures
Data diagnosis supports the following data structures.

* data frame: data.frame class.
* data table: tbl_df class.
* table of DBMS: table of the DBMS through tbl_dbi
  + Use dplyr as the back-end interface for any DBI-compatible database.
  
  
## List of supported tasks of data analytics

### Diagnose Data

#### Overall Diagnose Data
Tasks | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
describe overview of data | Inquire basic information to understand the data in general | `overview()` | 
summary overview object  | summary described overview of data | `summary.overview()` | 
plot overview object | plot described overview of data | `plot.overview()` | 
diagnose data quality of variables | The scope of data quality diagnosis is information on missing values and unique value information | `diagnose()` | x
diagnose data quality of categorical variables | frequency, ratio, rank by levels of each variables | `diagnose_category()` | x
diagnose data quality of numerical variables | descriptive statistics, number of zero, minus, outliers | `diagnose_numeric()` | x
diagnose data quality for outlier  | number of outliers, ratio, mean of outliers, mean with outliers, mean without outliers | `diagnose_outlier()` | x
plot outliers information of numerical data  | box plot and histogram whith outliers, without outliers | `plot_outlier.data.frame()` | x
plot outliers information of numerical data by target variable | box plot and density plot whith outliers, without outliers | `plot_outlier.target_df()` | x
diagnose combination of categorical variables | Check for sparse cases of level combinations of categorical variables | `diagnose_sparese()` | 

#### Visualize Missing Values
Tasks | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
pareto chart for missing value | visualize the Pareto chart for variables with a missing value. | `plot_na_pareto()` | 
combination chart for missing value | visualize the distribution of missing value by combining variables. | `plot_na_hclust()` | 
plot the combination variables that is include missing value | visualize the combinations of missing value across cases | `plot_na_intersect()` | 


#### Reporting
Types | Descriptions | Functions | Support DBI
:-----|:-------|:---|:---:
report the information of data diagnosis into a PDF file | report the information for diagnosing the data quality | `diagnose_report()` | x
reporting the information of data diagnosis into HTML file | report the information for diagnosing the quality of the data | `diagnose_report()` | x
reporting the information of data diagnosis into HTML file | dynamic report the information for diagnosing the quality of the data | `diagnose_web_report()` | x
reporting the information of data diagnosis into PDF and HTML files | paged report the information for diagnosing the quality of the data | `diagnose_paged_report()` | x


### EDA

#### Univariate EDA
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | summaries | frequency tables | `univar_category()` | 
categorical | summaries | chi-squared test  | `summary.univar_category()` | 
categorical | visualize | bar charts | `plot.univar_category()` | 
categorical | visualize | bar charts | `plot_bar_category()` | 
numerical | summaries | descriptive statistics | `describe()` | x
numerical | summaries | descriptive statistics | `univar_numeric()` | 
numerical | summaries | descriptive statistics of standardized variable | `summary.univar_numeric()` | 
numerical | visualize | histogram, box plot | `plot.univar_numeric()` | 
numerical | visualize | Q-Q plots | `plot_qq_numeric()` | 
numerical | visualize | box plot | `plot_box_numeric()` | 
numerical | visualize | histogram | `plot_hist_numeric()` | 

#### Bivariate EDA
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | summaries | frequency tables cross cases | `compare_category()` | 
categorical | summaries | contingency tables, chi-squared test | `summary.compare_category()` | 
categorical | visualize | mosaics plot | `plot.compare_category()` | 
numerical | summaries | correlation coefficient, linear model summaries | `compare_numeric()` | 
numerical | summaries | correlation coefficient, linear model summaries with threshold  | `summary.compare_numeric()` | 
numerical | visualize | scatter plot with marginal box plot | `plot.compare_numeric()` | 
numerical | Correlate | correlation coefficient | `correlate()` | x
numerical | Correlate | summaries with correlation matrix | `summary.correlate()` | x
numerical | Correlate | visualization of a correlation matrix | `plot.correlate()` | x
both | PPS | PPS(Predictive Power Score) | `pps()` | x
both | PPS | summaries with PPS | `summary.pps()` | x
both | PPS | visualization of a PPS matrix | `plot.pps()` | x

#### Normality Test
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
numerical | summaries | Shapiro-Wilk normality test | `normality()` | x
numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | `plot_normality()` | x

#### Relationship between target variable and predictors
Target Variable | Predictor | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | categorical | contingency tables | `relate()` | x
categorical | categorical | mosaics plot | `plot.relate()` | x
categorical | numerical | descriptive statistic for each levels and total observation | `relate()` | x
categorical | numerical | density plot | `plot.relate()` | x
categorical | categorical | bar charts | `plot_bar_category()` | 
numerical | categorical | ANOVA test | `relate()` | x
numerical | categorical | scatter plot | `plot.relate()` | x
numerical | numerical | simple linear model | `relate()` | x
numerical | numerical | box plot | `plot.relate()` | x
categorical | numerical | Q-Q plots | `plot_qq_numeric()` | 
categorical | numerical | box plot | `plot_box_numeric()` | 
categorical | numerical | histogram | `plot_hist_numeric()` | 

#### Reporting
Types | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
reporting the information of EDA into PDF file | reporting the information of EDA | `eda_report()` | x
reporting the information of EDA into HTML file | reporting the information of EDA | `eda_report()` | x
reporting the information of EDA into PDF file | dynamic reporting the information of EDA | `eda_web_report()` | x
reporting the information of EDA into HTML file | paged reporting the information of EDA | `eda_paged_report()` | x


### Transform Data

#### Find Variables
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
missing values  | find the variable that contains the missing value in the object that inherits the data.frame | `find_na()` | 
outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | `find_outliers()` | 
skewed variable | find the numerical variable that is the skewed variable that inherits the data.frame | `find_skewness()` | 

#### Imputation
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
missing values  | missing values are imputed with some representative values and statistical methods. | `imputate_na()` | 
outliers | outliers are imputed with some representative values and statistical methods. | `imputate_outlier()` | 
summaries | calculate descriptive statistics of the original and imputed values. | `summary.imputation()` | 
visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | `plot.imputation()` | 

#### Binning
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
binning | converts a numeric variable to a categorization variable | `binning()` | 
summaries | calculate frequency and relative frequency for each levels(bins) | `summary.bins()` | 
visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | `plot.bins()` | 
optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | `binning_by()` | 
summaries | summary metrics to evaluate the performance of binomial classification model | `summary.optimal_bins()` | 
visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | `plot.optimal_bins()` | 
infogain binning | categorizes a numeric characteristic into bins for multi-class variables using recursive information gain ratio maximization | `binning_rgr()` | 
visualize | generates plots for understanding distribution and  distribution by target variable after running binning_rgr() | `plot.infogain_bins()` | 
evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | `performance_bin()` | 
summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | `summary.performance_bin()` | 
visualize | It generates plots to understand frequency, WoE by bins using performance_bin after running binning_by() | `plot.performance_bin()` | 
visualize | extract bins from "bins" and "optimal_bins" objects | `extract.bins()` | 

#### Diagnose Binned Variable
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model | `performance_bin()` | 
summaries | summary method for "performance_bin". summary metrics to evaluate the performance of the binomial classification model | `
summary.performance_bin()` | 
visualize | visualize for understanding frequency, WoE by bins using performance_bin and something else | `plot.performance_bin()` | 

#### Transformation
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
transformation | performs variable transformation for standardization and resolving skewness of numerical variables | `transform()` | 
summaries | compares the distribution of data before and after data transformation | `summary.transform()` | 
visualize | visualize two kinds of a plot by attribute of the 'transform' class. The transformation of a numerical variable is a density plot | `plot.transform()` | 

#### Reporting
Types | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
reporting the information of transformation into PDF | reporting the information of transformation | `transformation_report()` | 
reporting the information of transformation into HTML | reporting the information of transformation | `transformation_report()` | 
reporting the transformation information into PDF | dynamic reporting the transformation information | `transformation_web_report()` | 
reporting the information of transformation into HTML | paged reporting the information of transformation | `transformation_paged_report()` | 


### Miscellaneous

#### Statistics
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
statistics  | calculate the entropy | `entropy()` | 
statistics  | calculate the skewness of the data | `skewness()` | 
statistics  | calculate the kurtosis of the data | `kurtosis()` | 
statistics  | calculate the Jensen-Shannon divergence between two probability distributions | `jsd()` | 
statistics  | calculate the Kullback-Leibler divergence between two probability distributions | `kld()` | 
statistics  | calculate the Cramer's V statistic between two categorical(discrete) variables | `cramer()` | 
statistics  | calculate the Theil's U statistic between two categorical(discrete) variables | `theil()` | 
statistics  | finding percentile of a numerical variable. | `get_percentile()` | 
statistics  | transform a numeric vector using several methods like "log", "sqrt", "log+1", "log+a", "1/x", "x^2", "x^3", "Box-Cox", "Yeo-Johnson"| `get_transform()` | 
statistics  | calculate the Cramer's V statistic | `cramer()` | 
statistics  | calculate the Theil's U statistic | `theil()` | 

#### Programming
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
programming  | extracts variable information having a certain class from an object inheriting data.frame | `find_class()` | 
programming  | gets class of variables in data.frame or tbl_df | `get_class()` | 
programming  | retrieves the column information of the DBMS table through the tbl_bdi object of dplyr | `get_column_info()` | 
programming  | finding the user machine's OS. | `get_os()` | 
programming  | import Google fonts | `import_google_font()` | 
