---
title: "Hutils"
author: "Hugh Parsonage"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{hutils}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# `hutils` package
My name is Hugh. 
I've written some miscellaneous functions that don't seem to belong in a particular package.
I've usually put these in `R/utils.R` when I write a package.
Thus, `hutils`.

This vignette just goes through each exported function.

```{r knitrOpts}
library(knitr)
suggested_packages <- c("geosphere", "nycflights13", "dplyr", "ggplot2", "microbenchmark")
opts_chunk$set(eval = all(vapply(suggested_packages, requireNamespace, quietly = TRUE, FUN.VALUE = FALSE)))
```

```{r loadPackages}
tryCatch({
  library(geosphere)
  library(nycflights13)
  library(dplyr, warn.conflicts = FALSE)
  library(ggplot2)
  library(microbenchmark)
  library(data.table, warn.conflicts = FALSE)
  library(magrittr)
  library(hutils, warn.conflicts = FALSE)
}, 
# requireNamespace does not detect errors like
# package ‘dplyr’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
error = function(e) {
  opts_chunk$set(eval = FALSE)
})
options(digits = 4)
```

## Aliases

These are simple additions to `magrittr`'s aliases, including: 
capitalized forms of `and` and `or` that invoke `&&` and `||` (the 'long-form' logical operators)
and `nor` / `neither` functions.

The main motivation is to make the source code easier to indent.
I occasionally find such source code easier to use.

```{r aliases}
OR(OR(TRUE,
      stop("Never happens")),  ## short-circuits
   AND(FALSE,
       stop("Never happens")))  
```

`nor` (or `neither` which is identical) returns `TRUE` if and only if both arguments are `FALSE`.

## `coalesce` and `if_else`
These are near drop-in replacements for the equivalent functions from `dplyr`. 
They are included here because they are very useful outside of the tidyverse,
but may be required in circumstances where importing `dplyr` (with all of its dependencies) 
would be inappropriate.

They attempt to be drop-in replacements but:

  1. `hutils::if_else` only works with `logical`, `integer`, `double`, and `character` type vectors.
     Lists and factors won't work.
  2. `hutils::coalesce` short-circuits on its first argument; if there are no `NA`s in `x` then `x` is returned, even
     if the other vectors are the wrong length or type.

In addition, `hutils::if_else` is generally faster than `dplyr::if_else`:

```{r compare_if_else}
my_check <- function(values) {
  all(vapply(values[-1], function(x) identical(values[[1]], x), logical(1)))
}

set.seed(2)
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print

cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 1, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 1, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print
```

This speed advantage also appears to be true of `coalesce`:

```{r compare_coalesce}
x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters, NA), size = 100e3, replace = TRUE)
B <- sample(c(letters, NA), size = 100e3, replace = TRUE)
C <- sample(c(letters, NA), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
```

especially during short-circuits:

```{r compare_coalesce_short_circuit_x}
x <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
```

```{r compare_coalesce_short_circuit_A}
x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
```


## Drop columns
To drop a column from a `data.table`, you set it to `NULL`

```{r canonical_drop_DT}
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT[, A := NULL]
```

There's nothing wrong with this, but I've found the following a useful alias, especially in a `magrittr` pipe.

```{r drop_col_hutils}
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_col("A") %>%
  drop_col("B")

# or
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_cols(c("A", "B"))
```

These functions simple invoke the canonical form, so won't be any faster.

Additionally, one can drop columns by a regular expression using `drop_colr`:

```{r drop_colr}
flights <- as.data.table(flights)

flights %>%
  drop_colr("time") %>%
  drop_colr("arr(?!_delay)", perl = TRUE)
```

## `drop_constant_cols`
When a table is filtered, the filtrate is often redundant. 

```{r drop_constant_cols}
flights %>%
  .[origin == "JFK"] %>%
  drop_constant_cols
```

## `drop_empty_cols`
This function drops columns in which all the values are `NA`.
 
```{r drop_empty_cols}
planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)]

planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)] %>% 
  # drops speed
  drop_empty_cols
```

## `duplicated_rows`
There are many useful functions for detecting duplicates in R.
However, in interactive use, I often want to not merely see which values are duplicated,
but also compare them to the original.
This is especially true when I am comparing duplicates across a *subset* of columns in a a `data.table`.

```{r duplicated_rows}
flights %>%
  # only the 'second' of the duplicates is returned
  .[duplicated(., by = c("origin", "dest"))]  

flights %>%
  # Both rows are returned and (by default)
  # duplicates are presented adjacently
  duplicated_rows(by = c("origin", "dest"))
```

## Haversine distance
To emphasize the miscellany of this package, I now present `haversine_distance`
which simply returns the distance between two points on the Earth,
given their latitude and longitude.

I prefer this to other packages' implementations.
Although the `geosphere` package can do a lot more than calculate distances between points, I find the interface for `distHaversine` unfortunate as it cannot be easily used inside a `data.frame`. In addition, I've found the arguments clearer in `hutils::haversine_distance` rather than trying to remember whether to use `byrow` inside the `matrix` function while passing to `distHaversine`.

```{r haversine_distance}
DT1 <- data.table(lat_orig = runif(1e5, -80, 80),
                  lon_orig = runif(1e5, -179, 179),
                  lat_dest = runif(1e5, -80, 80),
                  lon_dest = runif(1e5, -179, 179))

DT2 <- copy(DT1)

microbenchmark(DT1[, distance := haversine_distance(lat_orig, lon_orig,
                                                    lat_dest, lon_dest)],
               
               DT2[, distance := distHaversine(cbind(lon_orig, lat_orig),
                                               cbind(lon_orig, lat_orig))])
rm(DT1, DT2)
```


## `mutate_other`

There may be occasions where a categorical variable in a `data.table` may need 
to modified to reduce the number of distinct categories. For example, you may 
want to plot a chart with a set number of facets, or ensure the smooth operation of
`randomForest`, which accepts no more than
32 levels in a feature.

`mutate_other` keeps the *n* most common categories and changes the 
other categories to `Other`.

```{r mutate-other, results='asis'}
set.seed(1)
DT <- data.table(Fruit = sample(c("apple", "pear", "orange", "tomato", "eggplant"),
                                size = 20,
                                prob = c(0.45, 0.25, 0.15, 0.1, 0.05),
                                replace = TRUE),
                 Price = rpois(20, 10))

kable(mutate_other(DT, "Fruit", n = 3)[])
```

## `ngrep`

This is a 'dumb' negation of `grep`.
In recent versions of R, the option `invert = FALSE` exists.
A slight advantage of `ngrep` is that it's shorter to type. 
But if you don't have arthritis, best use `invert = FALSE` or `!grepl`.


## `notin` `ein` `enotin` `pin`

These functions provide complementary functionality to `%in%`:

### `%notin%`
`%notin%` is the negation of `%in%`, but also uses the package `fastmatch` to increase the speed of the operation

### `%ein%` and `%enotin%`
The functions `%ein%` and `%enotin%` are motivated by a different sort of problem.
Consider the following statement:

```{r iris-veriscolor}
iris <- as.data.table(iris)
iris[Species %in% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)
```

On the face of it, this appears to give the average ratio of *Iris setosa* and *Iris versicolour* irises. However, it only gives the average ratio of *setosa* irises, as the correct spelling is *Iris versicolor* not *-our*. This particular error is easy to make, (in fact when I wrote this vignette, the first hit of Google for  `iris dataset` made the same spelling error), but it's easy to imagine similar mistakes, such as mistaking the capitalization of a value.
The functions `%ein%` and `%enotin%` strive to reduce the occurrence of this mistake.
The functions operate exactly the same as `%in%` and `%enotin%` but error if any of the table of values to be matched against is not present in any of the values:

```{r iris-versicolor, error=TRUE}
iris <- as.data.table(iris)
iris[Species %ein% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)
```

The `e` stands for 'exists'; *i.e.* they should be read as "exists and in" and "exists and not in".

### `%pin%` 

This performs a partial match (*i.e* `grepl`) but with a possibly more readable or intuitive syntax

```{r pin}
identical(iris[grep("v", Species)],
          iris[Species %pin% "v"])
```

If the RHS has more than one element, the matching is done on alternation (*i.e.* OR):

```{r pin-multi}
iris[Species %pin% c("ver", "vir")] %>%
  head
```

There is an important qualification: if the RHS is `NULL`, then the result will be `TRUE` along the length of `x`, contrary to the behaviour of `%in%`. 
This is not entirely unexpected as `NULL` could legitimately be interpreted as \(\varepsilon\), the empty regular expression, which occurs in every string. 

### `provide.dir`

This is the same as `dir.create` but checks whether the target exists or not and does nothing if it does. Motivated by `\providecommand` in \(\rm\LaTeX{}\), which creates a macro only if it does not exist already.


### `select_which` 

This provides a similar role to `dplyr::select_if` but was originally part of `package:grattan` so has a different name. It simply returns the columns whose *values* return `TRUE` when `Which` is applied.
Additional columns (which may or not may satisfy `Which`) may be included by using `.and.dots`.
(To remove columns, you can use `drop_col`).

```{r}
DT <- data.table(x = 1:5,
                 y = letters[1:5],
                 AB = c(NA, TRUE, FALSE, TRUE, FALSE))
select_which(DT, anyNA, .and.dots = "y")
```


### `set_cols_first`

Up to and including `data.table 1.10.4`, one could only reorder the columns by supplying all the columns. 
You can use `set_cols_first` and `set_cols_last` to put columns first or last without supplying all the columns.

### Unique keys

In some circumstances, you need to know that the `key` of a `data.table` is unique. 
For example, you may expect a join to be performed later, without specifying `mult='first'` or permitting Cartesian joins.
`data.table` does not require a `key` to be unique and does not supply tools to check the uniqueness of keys. `hutils` supplies two simple functions: `has_unique_key` which when applied to a `data.table` returns `TRUE` if and only if the table has a key and it is unique.

`set_unique_key` does the same as `setkey` but will error if the resultant key is not unique.

## `hutils` v1.1.0

### `auc` 

The area under the [(ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) gives a single value to measure the tradeoff between true positives and false positives. 

```{r}
dt <- data.table(y = !sample(0:1, size = 100, replace = TRUE), 
                 x = runif(100))
dt[, pred := predict(lm(y ~ x, data = .SD), newdata = .SD)]

dt[, auc(y, pred)]
```

### `select_grep`

To select columns matching a regular expression:

```{r select_grep}
flights %>%
  select_grep("arr")
```

You can use the additional arguments `.and` and `.but.not` to override the patterns.

```{r select_grep-and}
flights %>%
  select_grep("arr", .and = "year", .but.not = "arr_time")
```

## `hutils` v1.2.0

### `RQ`
This is simply a shorthand to test whether a package needs installing. The package name need not be quoted, for convenience.

```{r}
RQ(dplyr, "dplyr must be installed")
RQ("dplyr", "dplyr needs installing", "dplyr installed.")
```

### `ahull`

This locates the biggest rectangle beneath a curve:

```{r ahull-1}
if (!identical(Sys.info()[["sysname"]], "Darwin"))
  ggplot(data.table(x = c(0, 1, 2, 3, 4), y = c(0, 1, 2, 0.1, 0))) +
  geom_area(aes(x, y)) +
  geom_rect(data = ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0.1, 0)),
            aes(xmin = xmin,
                xmax = xmax,
                ymin = ymin,
                ymax = ymax),
            color = "red") 
```

```{r ahull-demos, fig.width = 8, fig.height = 6}
set.seed(101)
ahull_dt <-
  data.table(x = c(0:100) / 100,
             y = cumsum(rnorm(101, 0.05)))
if (!identical(Sys.info()[["sysname"]], "Darwin"))
ggplot(ahull_dt) +
  geom_area(aes(x, y)) + 
  geom_rect(data = ahull(ahull_dt), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "red") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "blue") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minH = 4), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "green") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minW = 0.25), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "white",
            fill = NA)


```


## `hutils` v1.3.0

### `weighted_quantile`

Simply a version of `quantile` supporting weighted values:

```{r weighted_quantile-ex}
x <- 1:10
w <- c(rep(1, 5), rep(2, 5))
quantile(x, prob = c(0.25, 0.75), names = FALSE)

weighted_quantile(x, w, p = c(0.25, 0.75))
```


### `mutate_ntile`

To add a column of ntiles (say, for later summarizing):

```{r mutate_ntile-ex}
flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)

```

You can use non-standard evaluation (as above) or you can quote the `col` 
argument.  Use `character.only = TRUE` to ensure column is only interpreted
as character.

```{r mutate_ntile-ex-charonly}
flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)
```

```{r mutate_ntile-ex-2}
flights %>%
  as.data.table %>%
  mutate_ntile("distance",
               n = 5L,
               character.only = TRUE) %>%
  .[, dep_delay := coalesce(dep_delay, 0)] %>%
  .[, .(avgDelay = mean(dep_delay)), keyby = "distanceQuintile"]

```

### `longest_affix`

Trim common affixes can be useful during data cleaning:

```{r longest-affix}
trim_common_affixes(c("CurrentHousingCosts(weekly)",
                      "CurrentFuelCosts(weekly)"))
```

## `hutils 1.4.0`

### `%<->%`

Referred to as `swap` in the documentation. Used to swap values between object names

```{r swap}
a <- 1
b <- 2
a %<->% b
identical(c(a, b), c(2, 1))
```

### `average_bearing` 

Determine the average bearing of vectors. Slightly more difficult than simply the average
modulo 360 since its the most acute sector is desired.

```{r average-bearing}
average_bearing(0, 270)  # NW
mean(c(0, 270))          # SE (i.e. wrong)
```


### `dir2` 

This is a faster version of `list.files` for Windows only, utilizing the `dir` command 
on the command prompt.

### `Mode` 

The statistical mode; the most common element.

```{r Mode-eg}
Mode(c(1, 1, 1, 2, 3))
```

### `replace_pattern_in`

A cousin of `find_pattern_in`, but instead of collecting the results, it replaces the 
contents sought with the replacement provided.

### `samp`

A safer version of `sample`. I use it because I found the following behaviour of sample
surprising.

```{r samp-eg}
DT <- data.table(x = c(5, 2, 3),
                 y = c(5, 3, 4))
DT[, .(Base = sample(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]
DT[, .(Base = samp(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]
```