---
title: "How caching works in tidywikidatar"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How caching works in tidywikidatar}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidywikidatar)
```

In order to reduce load on Wikidata's server and to speed up the processing of data, `tidywikidatar` makes extensive use of local caching.

## What data are cached locally

There are a few types of data that are cached locally:

- searches run with `tw_search()`
- data about an item, typically retrieved with `tw_get()` or `tw_get_property()`
- labels or description of properties, typically retrieved with `tw_get_property_label()` and `tw_get_property_description()`
- qualifiers of properties, typically retrieved with `tw_get_qualifiers()`
- data retrieved from (or about) Wikipedia pages, with `tw_get_wikipedia()`, and `tw_get_wikipedia_page_links()`

To reduce space used for local caching and speed up processing time, it is possible to store only labels and information available in a given language when relevant. 


## Caching with SQLite

In `tidywikidatar`, it is possible to enable caching with:

```{r}
tw_enable_cache()
```


If you do not include further parameters, by default `tidywikidatar` will use a local SQLite database for caching.

You can choose in which folder the SQLite database will be stored with `tw_set_cache_folder()`; if not already existing, you can create that folder with `tw_create_cache_folder()`.

```{r eval=FALSE}
tw_set_cache_folder(path = fs::path(
  fs::path_home_r(),
  "R",
  "tw_data"
))
tw_create_cache_folder()
```


## Caching with other database backends

Support for other database backends is now available. They can be accessed most easily using the following approach, having ensured that the relevant driver (and `odbc` package) have previously been installed:

```{r eval = FALSE}
tw_enable_cache(SQLite = FALSE)
tw_set_cache_db(
  driver = "MySQL",
  host = "localhost",
  port = 3306,
  database = "tidywikidatar",
  user = "secret_username",
  pwd = "secret_password"
)


# for testing, consider running a local database e.g. with:
# docker run --name tidywikidatar_db -p 3306:3306 -e MYSQL_ROOT_PASSWORD=secret_root_password -e MYSQL_USER=secret_username -e MYSQL_PASSWORD=secret_password -e MYSQL_DATABASE=tidywikidatar mysql:latest
```

It is also technically possible to pass directly a connection generated with `DBI::dbConnect()` to each function. 


## Name of tables in cached databases

Each database has a table for each language and type of content. For example, item information retrieved with `tw_get(id = "Q180099", language = "en")` will be stored in a table called `tw_item_en`.

The name of the table is unique and is generated by `tw_get_cache_table_name()`. For example:

```{r}
tw_get_cache_table_name(type = "item", language = "en")
```


## Column types and indexing

Due to limited familiarity with different database backends and limited time for testing, the creation of database tables is left to the default values of `DBI::dbWriteTable()`. For occasional use, this should not be an issue. However, when the local cache reaches millions rather than only thousands of rows, response time from a MySql database can take a few seconds, rather than a fraction of a second as would be expected. To deal with this, new functions for adding indexing to cache tables have been introduced, `tw_check_cache_index()`, `tw_index_cache_item()`, and `tw_index_cache_search()`. It is possible to apply this to all existing functions of a given type as outlined below. This speeds up retrieval time dramatically on MySql databases; impact on other types of databases has not been thoroughly tested.

```{r eval=FALSE}
db <- tw_connect_to_cache()


tables_v <- DBI::dbListTables(conn = db)

# for search cache tables
purrr::walk(
  .x = tables_v[stringr::str_starts(string = tables_v, "tw_search_item")],
  .f = function(x) {
    tw_index_cache_search(table_name = x)
  }
)

# for item cache tables
purrr::walk(
  .x = tables_v[stringr::str_starts(string = tables_v, "tw_item")],
  .f = function(x) {
    tw_index_cache_item(table_name = x)
  }
)
```