---
title: "Univariate analysis of continuous and categorical variables"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Univariate analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers four basic functions to quickly output relevant statistics:

- `describe()` for continuous variables
- `tab_percentiles()` for continuous variables
- `describe_cat()` for categorical variables
- `tab_frequencies()` for categorical variables

```{r setup, message = FALSE, warning = FALSE, include = FALSE}
library(tidycomm)
```

For demonstration purposes, we will use sample data from the [Worlds of Journalism](https://worldsofjournalism.org/) 2012-16 study included in tidycomm. 

```{r}
WoJ
```


## Describe continuous variables

`describe()` outputs several measures of central tendency and variability for all variables named in the function call:

```{r}
WoJ %>%  
  describe(autonomy_selection, autonomy_emphasis, work_experience)
```

If no variables are passed to `describe()`, all numeric variables in the data are described:

```{r}
WoJ %>% 
  describe()
```

Data can be grouped before describing:

```{r}
WoJ %>%  
  dplyr::group_by(country) %>% 
  describe(autonomy_emphasis, autonomy_selection)
```

The returning results from `describe()` can also be visualized:

```{r}
WoJ %>% 
  describe() %>% 
  visualize()
```

In addition, percentiles can easily be extracted from continuous variables:

```{r}
WoJ %>% 
  tab_percentiles()
```

Percentiles can also be visualized:

```{r}
WoJ %>% 
  tab_percentiles(trust_parties) %>% 
  visualize()
```



## Describe categorical variables

`describe_cat()` outputs a short summary of categorical variables (number of unique values, mode, N of mode) of all variables named in the function call:

```{r}
WoJ %>% 
  describe_cat(reach, employment, temp_contract)
```

If no variables are passed to `describe_cat()`, all categorical variables (i.e., `character` and `factor` variables) in the data are described:

```{r}
WoJ %>% 
  describe_cat()
```

Data can be grouped before describing:

```{r}
WoJ %>% 
  dplyr::group_by(reach) %>% 
  describe_cat(country, employment)
```

Again, also the results from `describe_cat()` can be visualized like so:

```{r}
WoJ %>% 
  describe_cat() %>% 
  visualize()
```

## Tabulate frequencies of categorical variables

`tab_frequencies()` outputs absolute and relative frequencies of all unique values of one or more categorical variables:

```{r}
WoJ %>%  
  tab_frequencies(employment)
```

Passing more than one variable will compute relative frequencies based on all combinations of unique values:

```{r}
WoJ %>%  
  tab_frequencies(employment, country)
```

You can also group your data before. This will lead to within-group relative frequencies:

```{r}
WoJ %>% 
  dplyr::group_by(country) %>%  
  tab_frequencies(employment)
```

(Compare the columns `percent`, `cum_n` and `cum_percent` with the output above.)

And of course, also `tab_frequencies()` can easily be visualized:

```{r}
WoJ %>% 
  tab_frequencies(country) %>% 
  visualize()
```

