---
title: "Using bigKNN as Exact Ground Truth"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using bigKNN as Exact Ground Truth}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`bigKNN` is a natural exact backend for evaluating approximate nearest-neighbour
systems. It gives you two especially useful tools:

- exact top-`k` neighbours that act as trusted ground truth
- exact reranking of approximate candidate sets against the original
  `big.matrix` reference

This vignette stays package-independent on purpose. Instead of depending on a
specific ANN backend, it uses a small exact reference set plus a hand-built
candidate matrix that behaves like approximate output.

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

if (!requireNamespace("bigmemory", quietly = TRUE)) {
  cat("This vignette requires the 'bigmemory' package.\n")
  knitr::knit_exit()
}

library(bigKNN)
library(bigmemory)
```

```{r helpers, include=FALSE}
neighbor_table <- function(index, query_ids, ref_ids, distance = NULL) {
  do.call(rbind, lapply(seq_along(query_ids), function(i) {
    out <- data.frame(
      query = query_ids[i],
      rank = seq_len(ncol(index)),
      neighbor = ref_ids[index[i, ]],
      row.names = NULL
    )
    if (!is.null(distance)) {
      out$distance <- signif(distance[i, ], 5)
    }
    out
  }))
}
```

# Why exact ground truth matters

Approximate search is usually evaluated against an exact baseline. Without that
baseline, it is hard to answer basic questions:

- Are the returned neighbours actually correct?
- Is the approximate system missing true near neighbours, or just ranking them
  oddly?
- Does widening the ANN candidate pool improve top-`k` quality enough to justify
  the extra cost?

Exact search is also useful as a debugging tool. If an ANN system suddenly
behaves differently after a tuning change, exact neighbours give you a stable
reference point for comparing what changed.

# Build a Small Exact Evaluation Set

We will use eight reference rows and three queries. The example is small so the
exact neighbours and the approximate candidate sets remain easy to inspect.

```{r create-data}
reference_points <- data.frame(
  id = paste0("r", 1:8),
  x1 = c(0.0, 0.2, 1.0, 1.2, 3.0, 3.2, 4.0, 4.2),
  x2 = c(0.0, 0.1, 0.9, 1.0, 3.0, 3.1, 4.0, 4.1),
  x3 = c(0.5, 0.4, 1.2, 1.1, 2.8, 2.9, 3.8, 3.9)
)

query_points <- data.frame(
  id = paste0("q", 1:3),
  x1 = c(0.1, 1.1, 3.1),
  x2 = c(0.0, 1.0, 3.0),
  x3 = c(0.45, 1.15, 2.85)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2", "x3")]))
query_matrix <- as.matrix(query_points[c("x1", "x2", "x3")])

reference_points
query_points
```

# Producing exact neighbours with `bigKNN`

The exact reference result is just the exact `k`-NN output for the same
reference, query set, metric, and `k` that you want to evaluate.

```{r exact-truth}
exact <- knn_bigmatrix(
  reference,
  query = query_matrix,
  k = 3,
  metric = "euclidean",
  exclude_self = FALSE
)

exact
neighbor_table(
  exact$index,
  query_ids = query_points$id,
  ref_ids = reference_points$id,
  distance = exact$distance
)
```

This object is the ground truth for the rest of the vignette. Every later
comparison asks how closely an approximate candidate set matches these exact top
three neighbours.

# Measuring recall with `recall_against_exact()`

Suppose another package returns the following approximate top-3 neighbour index
matrix:

```{r approx-top3}
approx_top3 <- rbind(
  c(8, 3, 1),
  c(7, 4, 2),
  c(6, 2, 5)
)

neighbor_table(
  approx_top3,
  query_ids = query_points$id,
  ref_ids = reference_points$id
)
```

Each query still recovers some correct neighbours, but one true top-3 neighbour
is missing for every row. We can quantify that with `recall_against_exact()`:

```{r recall}
recall_before <- recall_against_exact(exact, approx_top3, k = 3)

recall_before
data.frame(
  query = query_points$id,
  recall = recall_before$per_query,
  row.names = NULL
)
```

Recall is set-based rather than rank-based:

- it checks whether the exact top-`k` ids appear in the approximate top-`k`
- it does not care whether those ids are in the same internal order
- `overall` is the mean of the per-query recalls

That makes recall a good first coverage metric, but not a full ranking metric.
An ANN system can have excellent recall while still misordering the neighbours
within the recovered set.

# Reranking candidate sets with `rerank_candidates_bigmatrix()`

Reranking is useful when an approximate system returns a wider candidate pool
than the final `k` you care about. The idea is:

1. let the ANN stage generate a shortlist
2. compute exact distances only within that shortlist
3. sort the shortlist exactly and keep the best `top_k`

Here is a 5-candidate pool that contains the true exact top-3 ids for every
query, but not always in the first three slots:

```{r candidate-pool}
candidate_pool <- rbind(
  c(8, 3, 1, 2, 6),
  c(7, 4, 2, 3, 1),
  c(6, 2, 5, 7, 1)
)

neighbor_table(
  candidate_pool,
  query_ids = query_points$id,
  ref_ids = reference_points$id
)
```

Now rerank those candidates exactly against the reference matrix:

```{r rerank}
reranked <- rerank_candidates_bigmatrix(
  reference,
  query = query_matrix,
  candidate_index = candidate_pool,
  metric = "euclidean",
  top_k = 3
)

reranked
neighbor_table(
  reranked$index,
  query_ids = query_points$id,
  ref_ids = reference_points$id,
  distance = reranked$distance
)
```

The reranked result now matches the exact top-3 output, because the true
neighbours were present somewhere in the candidate pool.

# Comparing approximate candidates against exact results

The usual evaluation pattern is:

1. compute exact neighbours once on an evaluation subset
2. compare ANN output to exact truth with `recall_against_exact()`
3. rerank a wider candidate pool exactly
4. compare recall again after reranking

```{r compare-before-after}
recall_after <- recall_against_exact(exact, reranked, k = 3)

data.frame(
  stage = c("Approximate top-3", "Reranked top-3 from 5 candidates"),
  overall_recall = c(recall_before$overall, recall_after$overall),
  row.names = NULL
)
```

In this example, the approximate top-3 recall is `2/3`, while reranking the
5-wide candidate pool recovers perfect top-3 recall.

# Suggested workflow alongside a separate ANN package

A practical package-independent workflow looks like this:

1. Choose an evaluation subset of queries and references.
2. Run `knn_bigmatrix()` once to compute exact top-`k` neighbours on that
   subset.
3. Run your ANN package with the same metric definition and collect its
   candidate indices as a matrix or `big.matrix`.
4. Measure ANN recall with `recall_against_exact()`.
5. If the ANN backend produces a wider candidate pool than final `k`, rerank it
   exactly with `rerank_candidates_bigmatrix()`.
6. Compare recall before and after reranking, and decide whether the candidate
   width is worth the cost.

This keeps responsibilities clean:

- the ANN package handles fast candidate generation
- `bigKNN` supplies exact truth and exact candidate reranking

# Interpreting recall and reranking results

There are two distinct questions to keep apart.

First: does the approximate stage cover the correct neighbours at all?

- low recall means true neighbours are missing from the approximate top-`k`
- widening the ANN candidate pool is often the first fix

Second: if the correct ids are present somewhere in the candidate pool, are they
ranked well enough for the final top-`k`?

- reranking can repair ordering within the supplied candidate set
- reranking can also improve top-`k` coverage when true neighbours are present
  just outside the ANN top-`k`

The wider the candidate pool, the better reranking can work, but the more exact
distance calculations it has to do.

# Limits and caveats

Reranking improves ordering only within the candidates you provide. It cannot
invent missing neighbours. If the candidate pool is already too narrow, recall
may stay capped even after exact reranking:

```{r rerank-limit}
reranked_limited <- rerank_candidates_bigmatrix(
  reference,
  query = query_matrix,
  candidate_index = approx_top3,
  metric = "euclidean",
  top_k = 3
)

recall_after_limited <- recall_against_exact(exact, reranked_limited, k = 3)

data.frame(
  stage = c("Approximate top-3", "Reranked top-3 from same 3 candidates"),
  overall_recall = c(recall_before$overall, recall_after_limited$overall),
  row.names = NULL
)
```

Other important caveats:

- exact search still has a real computational cost, so evaluation often uses a
  representative subset rather than the entire production workload
- exact and approximate systems must use the same metric definition if recall is
  to mean anything
- `recall_against_exact()` measures set overlap, not rank quality
- sparse query inputs are supported in `bigKNN`, but currently densified on the
  R side before exact computation

Used this way, `bigKNN` becomes a clean exact companion to approximate search:
it tells you what the right neighbours are, how often the ANN system finds
them, and whether a wider candidate pool can be turned into a better final
result by exact reranking.
