---
title: "Notifiable Disease Surveillance with SINAN"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Notifiable Disease Surveillance with SINAN}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

The **SINAN (Sistema de Informacao de Agravos de Notificacao)** is Brazil's national notifiable disease surveillance system, managed by the Ministry of Health through DATASUS. It records individual notification forms for compulsory-notification diseases.

The `healthbR` package provides access to SINAN microdata from the DATASUS FTP:

| Feature | Details |
|---------|---------|
| Coverage | National (one file per disease per year) |
| Diseases | 31 notifiable disease codes |
| Years | 2007--2024 (final + preliminary) |
| Unit | One row per notification record |
| Format | .dbc files, decompressed internally |

## Getting started

```{r setup}
library(healthbR)
library(dplyr)
```

### Check available years

```{r}
sinan_years()
#> [1] 2007 2008 2009 ... 2022

sinan_years(status = "all")
#> [1] 2007 2008 ... 2022 2023 2024
```

### Module information

```{r}
sinan_info()
```

## Exploring diseases

SINAN covers 31 notifiable diseases. Use `sinan_diseases()` to browse them:

```{r}
# all available diseases
sinan_diseases()

# search by name or code
sinan_diseases(search = "dengue")
sinan_diseases(search = "sifilis")
sinan_diseases(search = "tuberculose")
```

Common disease codes:

| Code | Disease |
|------|---------|
| DENG | Dengue |
| CHIK | Chikungunya |
| ZIKA | Zika |
| TUBE | Tuberculose |
| HANS | Hanseniase |
| HEPA | Hepatites virais |
| SIFA | Sifilis adquirida |
| SIFC | Sifilis congenita |
| LEPT | Leptospirose |
| MENI | Meningite |

## Downloading data

### Basic download (dengue, single year)

```{r}
dengue_2022 <- sinan_data(year = 2022)
dengue_2022
```

### Multiple years

```{r}
tb <- sinan_data(year = 2020:2022, disease = "TUBE")
tb
```

### Selecting variables

```{r}
# only key variables (faster and less memory)
dengue_key <- sinan_data(
  year = 2022,
  disease = "DENG",
  vars = c("DT_NOTIFIC", "CS_SEXO", "NU_IDADE_N",
           "CS_RACA", "ID_MUNICIP", "CLASSI_FIN")
)
```

### Exploring variables

```{r}
sinan_variables()
sinan_variables(search = "sexo")
sinan_variables(search = "municipio")
```

## Filtering by state

SINAN files are **national** (not per-state). To filter by geographic unit, use
the `SG_UF_NOT` (UF of notification) or `ID_MUNICIP` (municipality code)
columns after download:

```{r}
# filter by UF
dengue_sp <- sinan_data(year = 2022) |>
  filter(SG_UF_NOT == "35")  # 35 = Sao Paulo

# filter by municipality
dengue_rj_capital <- sinan_data(year = 2022) |>
  filter(ID_MUNICIP == "330455")  # Rio de Janeiro capital
```

## Key variables

| Variable | Description |
|----------|-------------|
| DT_NOTIFIC | Notification date |
| ID_AGRAVO | Disease code (CID-10) |
| SG_UF_NOT | UF of notification (IBGE code) |
| ID_MUNICIP | Municipality of notification (IBGE 6 digits) |
| CS_SEXO | Sex (M/F/I) |
| NU_IDADE_N | Age (encoded: 1st digit = unit, digits 2-3 = value) |
| CS_RACA | Race/color (1=White, 2=Black, 3=Yellow, 4=Brown, 5=Indigenous) |
| CLASSI_FIN | Final classification (1=Confirmed, 2=Discarded) |
| EVOLUCAO | Outcome (1=Cured, 2=Death by disease, 3=Death other causes) |
| CRITERIO | Confirmation criteria (1=Lab, 2=Clinical-epi) |

### Using the dictionary

```{r}
# all coded variables
sinan_dictionary()

# specific variable
sinan_dictionary("CS_SEXO")
sinan_dictionary("EVOLUCAO")
sinan_dictionary("CLASSI_FIN")
```

## Preliminary vs. final data

SINAN publishes both final (definitive) and preliminary data. By default,
`sinan_years()` returns only final years:

```{r}
# final data only (default)
sinan_years(status = "final")

# preliminary data
sinan_years(status = "preliminary")

# both
sinan_years(status = "all")
```

Preliminary data (2023--2024) may still be revised by the Ministry of Health.

## Example: confirmed dengue cases by month

```{r}
dengue <- sinan_data(year = 2022, disease = "DENG") |>
  filter(CLASSI_FIN %in% c("1", "5")) |>  # confirmed cases

  mutate(month = as.integer(format(DT_NOTIFIC, "%m")))

cases_by_month <- dengue |>
  count(month) |>
  arrange(month)

cases_by_month
```

## Example: tuberculosis by sex and age group

```{r}
tb <- sinan_data(year = 2022, disease = "TUBE")

# decode age: 4th digit means years
tb_age <- tb |>
  filter(CLASSI_FIN == "1") |>
  mutate(
    age_unit = substr(NU_IDADE_N, 1, 1),
    age_value = as.integer(substr(NU_IDADE_N, 2, 3)),
    age_years = ifelse(age_unit == "4", age_value, NA_integer_),
    age_group = cut(age_years,
                    breaks = c(0, 15, 30, 45, 60, Inf),
                    labels = c("<15", "15-29", "30-44", "45-59", "60+"),
                    right = FALSE)
  )

tb_age |>
  filter(!is.na(age_group)) |>
  count(CS_SEXO, age_group) |>
  tidyr::pivot_wider(names_from = CS_SEXO, values_from = n)
```

## Example: incidence rate with Census denominators

Combine SINAN data with Census population to calculate incidence rates:

```{r}
# step 1: confirmed dengue by UF
dengue_uf <- sinan_data(year = 2022, disease = "DENG") |>
  filter(CLASSI_FIN %in% c("1", "5")) |>
  count(SG_UF_NOT, name = "cases")

# step 2: population from Census 2022
pop <- censo_populacao(year = 2022, territorial_level = "state")

# step 3: calculate incidence rate per 100,000
# incidence <- dengue_uf |>
#   left_join(pop, by = ...) |>
#   mutate(rate_100k = (cases / population) * 100000) |>
#   arrange(desc(rate_100k))
```

## Smart type parsing

By default, `sinan_data()` parses columns to appropriate types (dates, integers):

```{r}
# parsed types (default)
dengue <- sinan_data(year = 2022, disease = "DENG")
class(dengue$DT_NOTIFIC)  # Date
class(dengue$NU_ANO)      # integer

# raw character columns (backward-compatible)
dengue_raw <- sinan_data(year = 2022, disease = "DENG", parse = FALSE)

# override specific columns
dengue_custom <- sinan_data(
  year = 2022,
  col_types = list(DT_NOTIFIC = "character")
)
```

## Cache management

Downloaded data is cached locally for faster future access:

```{r}
# check cache status
sinan_cache_status()

# clear cache if needed
sinan_clear_cache()
```

If the `arrow` package is installed, data is cached in Parquet format for
faster loading. You can also use lazy evaluation:

```{r}
# lazy query (requires arrow)
dengue_lazy <- sinan_data(year = 2022, disease = "DENG", lazy = TRUE)
dengue_lazy |>
  filter(CLASSI_FIN == "1") |>
  select(DT_NOTIFIC, CS_SEXO, NU_IDADE_N, ID_MUNICIP) |>
  collect()
```

## Additional resources

- SINAN official page (`portalsinan.saude.gov.br`)
- [SIM vignette](datasus-modules.html) for mortality data
- [Census vignette](censo-denominadores.html) for population denominators
