---
title: "Introduction to funneljoin"
author: "Emily Robinson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to funneljoin}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  message = FALSE
)
```

The goal of funneljoin is to make it easy to analyze behavior funnels. For example, maybe you're interested in finding the people who visit a page and then register. Or you want all the times people click on an item and add it to their cart within 2 days. 

You can do this with funneljoin's `after_join()` function. The arguments are: 

  - `x`: a dataset with the first set of behaviors.
  - `y`: a dataset with the second set of behaviors.
  - `by_time`: a character vector to specify the time columns in x and y. Must be a single column in each tbl. Note that this column is used to filter for time y >= time x.
  - `by_user`: a character vector to specify the user or identity columns in x and y. Must be a single column in each tbl.
  - `mode`: the method used to join: "inner", "full", "anti", "semi", "right", "left".
  - `type`: the type of funnel used to distinguish between event pairs, such as "first-first", "last-first", "any-firstafter". See types of funnels.
  - `max_gap` (optional): the maximum gap between events. Can be a integer representing the number of seconds or a difftime object
  
## after_join types

Funneljoins can be any combination of `first`, `last`, `any`, and `lastbefore` with `first`, `last`,  `any`, and `firstafter`.

Let's take an example. We'll use two tables, one of landings and one of registrations, that come with the package. Each has a `user_id` and `timestamp` column. 

```{r}
library(dplyr)
library(funneljoin)
```

```{r}
landed
```

```{r}
registered
```

Let's say we wanted to get only the first time people landed and the first time they registered, *if* it was after their first landing. We would use a `first-first inner` join. 

```{r}
landed %>%
  after_inner_join(registered, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-first")
```

We now have a dataset of four rows. Notice a few things: 

  - `timestamp.y` is always greater than or equal to `timestamp.x`. This is a condition of `after_join`.
  - Because it's an `inner` join, we only have users who were in both tables and had a `timestamp.y` that was greater than a `timestamp.x`. 
  - We only have one entry per user, as we made each table distinct by user, taking only their first landing or registration, before joining.
  - Although user 4 had a registration after a landing ("2018-07-01" landing and "2018-07-02" registration), they are not in the table because their **first** registration was on "2018-06-10", so we only checked if that happened after their first landing. 

## Max gap 

What if instead we wanted all landing-registration pairs that happened within a 4-day window? We now add the `max_gap` argument. This takes either a number, which is the number of seconds, or a difftime object. Instead of calculating the number of seconds in 4 days, we'll just make a difftime object. Because we want any pair, not just the pairs of the first landings and registrations, we change the type to `any-any`.

We'll also add `gap_col = TRUE` to return a column, `.gap`, which is the gap in seconds between the events.

```{r}
landed %>%
  after_inner_join(registered, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "any-any", 
                   max_gap = as.difftime(4, units = "days"),
                   gap_col = TRUE)
```

We now have a dataset of 7 rows. Notice a few things: 

  - As always, `timestamp.y` is greater than or equal to `timestamp.x`. 
  - User 6 appears three times, as they have three landing-registration pairs where the time difference is less than four days.
  - Just as before, because it's an `inner` join, we only have users who were in both tables and had a `timestamp.y` that was greater than a `timestamp.x`. 

## Other types of joins

Left joins are very handy when you want to know what percentage of people who did X did Y afterward. For example, we may have an experiment where we want to know "what percentage of people registered after entering the experiment?"

In this case, we'd use a `first-firstafter` left join - we want to get a person's first experiment start and then their first registration afterward, if they have one. Let's add another column to our experiment starts table, which experiment variant someone was in. 

```{r}
experiment_starts <- tibble::tribble(
  ~user_id, ~timestamp, ~ alternative.name,
  1, "2018-07-01", "control",
  2, "2018-07-01", "treatment",
  3, "2018-07-02", "control",
  4, "2018-07-01", "control",
  4, "2018-07-04", "control",
  5, "2018-07-10", "treatment",
  5, "2018-07-12", "treatment",
  6, "2018-07-07", "treatment",
  6, "2018-07-08", "treatment"
) %>%
  mutate(timestamp = as.Date(timestamp))

experiment_registrations <- tibble::tribble(
  ~user_id, ~timestamp, 
  1, "2018-07-02", 
  3, "2018-07-02", 
  4, "2018-06-10", 
  4, "2018-07-02", 
  5, "2018-07-11", 
  6, "2018-07-10", 
  6, "2018-07-11", 
  7, "2018-07-07"
) %>%
  mutate(timestamp = as.Date(timestamp))
```

```{r}
experiment_starts %>%
  after_left_join(experiment_registrations, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-firstafter")
```

We now have an entry for each user in the experiment_starts table, which has the time of their first registration afterward or NA if they did not have a registration afterwards. Each user only appears once since it's a `first-something` join.   

We can use funneljoin's `summarize_conversions()` function to get the number of starts and number of conversions (registrations) in each group - you just need to specify the column that indicates whether someone converted - if it's NA or FALSE, it will be treated as FALSE, otherwise TRUE. 

```{r}
experiment_starts %>%
  after_left_join(experiment_registrations, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-firstafter") %>% 
  group_by(alternative.name) %>%
  summarize_conversions(converted = timestamp.y)
```

`summarize_conversions` also works if we have multiple types of conversions, specified by a column.

```{r}
for_conversion <- tibble::tribble(
  ~"experiment_group", ~"first_event", ~"last_event", ~"type", 
  "control", "2018-07-01", NA, "click",
  "control", "2018-07-02", NA, "click",
  "control", "2018-07-03", "2018-07-05", "click",
  "treatment", "2018-07-01", "2018-07-05", "click",
  "treatment", "2018-07-01", "2018-07-05", "click",
  "control", "2018-07-01", NA, "purchase",
  "control", "2018-07-02", NA, "purchase",
  "control", "2018-07-03", NA, "purchase",
  "treatment", "2018-07-01", NA, "purchase",
  "treatment", "2018-07-01", "2018-07-05", "purchase"
)

for_conversion %>%
  group_by(type, experiment_group) %>%
  summarize_conversions(converted = last_event)
```

## Summarize Prop Tests

`summarize_prop_tests()` takes in a dataset with at least three columns - `nb_users`, `nb_conversions`, and a column indicating experiment group. It can also have an additional column that is the type of conversion - for example, you could have clicks and purchases. Each type of conversion can only have two rows, one `control` and one other group. If you have that additional column of type, you need to group by it first. 

It returns a dataset with at least 5 columns:

  - `control`: the conversion rate of the control group
  - `treatment`: the  conversion rate of the treatment group
  - `p_value` of the proportion test
  - `pct_change`: the percentage difference between the control and treatment group
  - `pct_change_low` and `pct_change_high`: the bayesian estimates for a 90% confidence interval. 
  
If you had a type column, it will also be in the output. 

```{r}
tbl <- tibble::tribble(
  ~ experiment_group, ~nb_users, ~nb_conversions, ~type,
  "control", 500, 200, "purchase",
  "treatment", 500, 100, "purchase", 
  "control", 500, 360, "click",
  "treatment", 500, 375, "click"
)

tbl %>%
  group_by(type) %>%
  summarize_prop_tests(alternative_name = experiment_group)
```
