---
title: "JSON output vs. schema-validated output in LLMR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{JSON output vs. schema-validated output in LLMR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r}
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") )
```  


## TL;DR

- **JSON mode**: ask the model for “a JSON object.” Lower friction. Weak guarantees.  
- **Schema output**: give a JSON Schema and request strict validation. Higher reliability *when the provider enforces it*.  
- **Reality**: enforcement and request shapes differ across providers. Use **defensive parsing** and **local validation**.

---

## What the major providers actually support

- **OpenAI-compatible (OpenAI, Groq, Together, x.ai, DeepSeek)**  
  Chat Completions accept a `response_format` (e.g., `{"type":"json_object"}` or a JSON-Schema payload). Enforcement varies by provider but the interface is OpenAI-shaped.  
  See [OpenAI API overview](https://platform.openai.com/docs/guides/structured-outputs), [Groq API (OpenAI-compatible)](https://console.groq.com/docs/structured-outputs), [Together: OpenAI compatibility](https://docs.together.ai/docs/json-mode), [x.ai: OpenAI API schema](https://docs.x.ai/docs/guides/structured-outputs), [DeepSeek: OpenAI-compatible endpoint](https://api-docs.deepseek.com/guides/json_mode)

- **Anthropic (Claude)**  
  No global “JSON mode.” Instead, you **define a tool** with an **`input_schema`** (JSON Schema) and **force** it via `tool_choice`, so the model must return a JSON object that validates the schema.  
  See [Anthropic Messages API: tools & `input_schema`](https://docs.claude.com/en/api/messages#tools)

- **Google Gemini (REST)**  
  Set `responseMimeType = "application/json"` in `generationConfig` to request JSON. Some models also accept **`responseSchema`** for constrained JSON (model-dependent).  
  See [Gemini documentation](https://ai.google.dev/gemini-api/docs/)
---

## Why prefer schema output?

- **Deterministic downstream code**: predictable keys/types enable typed transforms.  
- **Safer integrations**: strict mode avoids extra keys, missing fields, or textual preambles.  
- **Faster failure**: invalid generations fail early, where retry/backoff is easy to manage.

## Why JSON-only still matters

- **Broadest support** across models/providers/proxies.  
- **Low ceremony** for exploration, labeling, and quick prototypes.

---

## Quirks you will hit in practice

- Models often wrap JSON in **code fences** or add pre/post text.  
- Arrays/objects appear where you expected scalars; **ints vs doubles** vary by provider/sample.  
- **Safety/length caps** can truncate output; detect and handle “finish_reason = length/filter.”  

### LLMR helpers to blunt those edges

- `llm_parse_structured()` strips fences and extracts the **largest balanced** `{...}` or `[...]` before parsing.  
- `llm_parse_structured_col()` hoists fields (supports dot/bracket paths and JSON Pointer) and keeps non-scalars as list-columns.  
- `llm_validate_structured_col()` validates locally via **jsonvalidate (AJV)**.  
- `enable_structured_output()` flips the right provider switch (OpenAI-compat `response_format`, Anthropic **tool + `input_schema`**, Gemini `responseMimeType`/`responseSchema`).

---

## Minimal patterns (guarded code)

All chunks use a tiny helper so your document **knits even without API keys**.

```{r}
safe <- function(expr) tryCatch(expr, error = function(e) {message("ERROR: ", e$message); NULL})
```

### 1) JSON mode, no schema (works across OpenAI-compatible providers)

```{r}
safe({
  library(LLMR)
  cfg <- llm_config(
    provider = "openai",                # try "groq" or "together" too
    model    = "gpt-4.1-nano",
    temperature = 0
  )

  # Flip JSON mode on (OpenAI-compat shape)
  cfg_json <- enable_structured_output(cfg, schema = NULL)

  res    <- call_llm(cfg_json, 'Give me a JSON object {"ok": true, "n": 3}.')
  parsed <- llm_parse_structured(res)

  cat("Raw text:\n", as.character(res), "\n\n")
  str(parsed)
})
```

**What could still fail?** Proxies labeled “OpenAI-compatible” sometimes accept `response_format` but don’t strictly enforce it; LLMR’s parser recovers from fences or pre/post text.

---

### 2) **Schema mode that actually works** (Groq + Qwen, *open-weights / non-commercial friendly*)

Groq serves Qwen 2.5 Instruct models with OpenAI-compatible APIs. Their **Structured Outputs** feature enforces JSON Schema and (notably) expects **all properties to be listed under `required`**.

```{r}
safe({
  library(LLMR); library(dplyr)

  # Schema: make every property required to satisfy Groq's stricter check
  schema <- list(
    type = "object",
    additionalProperties = FALSE,
    properties = list(
      title = list(type = "string"),
      year  = list(type = "integer"),
      tags  = list(type = "array", items = list(type = "string"))
    ),
    required = list("title","year","tags")
  )

  cfg <- llm_config(
    provider = "groq",
    model    = "qwen-2.5-72b-instruct",   # a Qwen Instruct model on Groq
    temperature = 0
  )
  cfg_strict <- enable_structured_output(cfg, schema = schema, strict = TRUE)

  df  <- tibble(x = c("BERT paper", "Vision Transformers"))
  out <- llm_fn_structured(
    df,
    prompt   = "Return JSON about '{x}' with fields title, year, tags.",
    .config  = cfg_strict,
    .schema  = schema,          # send schema to provider
    .fields  = c("title","year","tags"),
    .validate_local = TRUE
  )

  out %>% select(structured_ok, structured_valid, title, year, tags) %>% print(n = Inf)
})
```

If your key is set, you should see `structured_ok = TRUE`, `structured_valid = TRUE`, plus parsed columns.

**Common gotcha**: If Groq returns a 400 error complaining about `required`, ensure **all properties** are listed in the `required` array. Groq's structured output implementation is stricter than OpenAI's.

---

### 3) Anthropic: force a schema via a tool (may require `max_tokens`)

```{r}
safe({
  library(LLMR)
  schema <- list(
    type="object",
    properties=list(answer=list(type="string"), confidence=list(type="number")),
    required=list("answer","confidence"),
    additionalProperties=FALSE
  )

  cfg <- llm_config("anthropic","claude-3-5-haiku-latest", temperature = 0)
  cfg <- enable_structured_output(cfg, schema = schema, name = "llmr_schema")

  res <- call_llm(cfg, c(
    system = "Return only the tool result that matches the schema.",
    user   = "Answer: capital of Japan; include confidence in [0,1]."
  ))

  parsed <- llm_parse_structured(res)
  str(parsed)
})
```

> Anthropic *requires* `max_tokens`; LLMR warns and defaults if you omit it.

---

### 4) Gemini: JSON response (plus optional response schema on supported models)

```{r}
safe({
  library(LLMR)

  cfg <- llm_config(
    "gemini", "gemini-2.5-flash-lite",
    response_mime_type = "application/json"  # ask for JSON back
    # Optionally: gemini_enable_response_schema = TRUE, response_schema = <your JSON Schema>
  )

  res <- call_llm(cfg, c(
    system = "Reply as JSON only.",
    user   = "Produce fields name and score about 'MNIST'."
  ))
  str(llm_parse_structured(res))
})
```

---

## Defensive patterns (no API calls)

````{r}
safe({
  library(LLMR); library(tibble)

  messy <- c(
    '```json\n{"x": 1, "y": [1,2,3]}\n```',
    'Sure! Here is JSON: {"x":"1","y":"oops"} trailing words',
    '{"x":1, "y":[2,3,4]}'
  )

  tibble(response_text = messy) |>
    llm_parse_structured_col(
      fields = c(x = "x", y = "/y/0")   # dot/bracket or JSON Pointer
    ) |>
    print(n = Inf)
})
````

**Why this helps**
Works when outputs arrive fenced, with pre/post text, or when arrays sneak in. Non-scalars become list-columns (set `allow_list = FALSE` to force scalars only).

---

## Pro tip: Combine with parallel execution

For production ETL workflows, combine schema validation with parallelization:

```{r, eval=FALSE}
library(LLMR); library(dplyr)

cfg_with_schema = llm_config('openai','gpt-4.1-nano')
  
setup_llm_parallel(workers = 10)

### Assuming there is a large data frame large_df

large_df |>
  llm_mutate_structured(
    result,
    prompt = "Extract: {text}",
    .config = cfg_with_schema,
    .schema = schema,
    .fields = c("label", "score"),
    tries = 3  # auto-retry failures
  )

reset_llm_parallel()
```

This processes thousands of rows efficiently with automatic retries and validation.

---

## Choosing the mode

* **Reporting / ETL / metrics:** Schema mode; fail fast and retry.
* **Exploration / ad-hoc:** JSON mode + recovery parser.
* **Cross-provider code:** Always wrap provider toggles with `enable_structured_output()` and run `llm_parse_structured()` + local validation.

---

## References

* OpenAI: Structure Output: [https://platform.openai.com/docs/guides/structured-outputs](https://platform.openai.com/docs/guides/structured-outputs)
* Groq: Structured Outputs: [https://console.groq.com/docs/structured-outputs](https://console.groq.com/docs/structured-outputs)
* Together: Structured Output: [https://docs.together.ai/docs/json-mode](https://docs.together.ai/docs/json-mode)
* x.ai: Structured Output: [https://docs.x.ai/docs/guides/structured-outputs](https://docs.x.ai/docs/guides/structured-outputs)
* DeepSeek: JSON Mode: [https://api-docs.deepseek.com/guides/json_mode](https://api-docs.deepseek.com/guides/json_mode)
* Anthropic: Messages API, tools & `input_schema`: [https://docs.claude.com/en/api/messages#body-tool-choice](https://docs.claude.com/en/api/messages#body-tool-choice)
* Google Gemini: Structured Output: [https://ai.google.dev/gemini-api/docs/structured-output](https://ai.google.dev/gemini-api/docs/structured-output)

