---
title: "Data-driven data validation"
subtitle: "Introducing ‘validatesuggest’"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Data-driven data validation: ‘validatesuggest’}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Intro

Data validation is a cornerstone in data intense industries, such as the
art of making official statistics. To create and maintain high quality
statistical output, data needs to be checked before being used in
statistical processes. A number of projects have been executed to
streamline and optimize data validation processes, both within
organisations as well as among organisations. An important success
factor for effective data validation is the design and maintenance of
validation rules that cover the dynamics of the data to be checked. This
has led to the definition of standardized validation rules that cover
the most common use-cases in official statistics. Examples are the
definition of internationally agreed 'main types of validation rules' by
Eurostat , and the set of recipes and standard functions as offered in
the well-known R-package validate, which are documented in the online
cookbook [3]. In this presentation we take another approach to rule
maintenance. In addition to the knowledge of the domain specialist we
let the data speak. Properties of the data, such as type, range,
distribution, correlation can be used to derive rules that catch the
essentials of the data. Since the number of rules that could potentially
be derived from data in general could be endless, we use the existing
international and national standardized validation rule systems to know
what type of rules make sense. A refinement of the concept is to also
take the time dimension of time series data into consideration. That way
time-dependent validation rules com into reach. The suggested rules are
expressed in a human-readable form, so that the domain specialist / rule
maintainer can inspect and understand them, as the data-driven concept
is intended as a suggestion to the rule maintainers. They should always
be checked and interpreted before putting into production. The type of
rules currently implemented in the experimental R-package
'validatesuggest' are:

validate, @validate[]

## Usage

```{r setup}
library(validatesuggest)
```
