---
title: "Data Cleaning: From KOSIS Raw Data to Tidy Format"
output: rmarkdown::html_vignette
date: "`r format(Sys.Date(), '%B %d, %Y')`"
vignette: >
  %\VignetteIndexEntry{Data Cleaning: From KOSIS Raw Data to Tidy Format} 
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  eval = FALSE,
  # Set to FALSE since this requires API keys and large data downloads
  out.width = "100%"
)
```

```r
library(tidycensuskr)
library(dplyr)
library(tidyr)
library(plyr)
library(kosis)

```

## Introduction

This vignette demonstrates how to clean and transform raw Korean census data from the Korean Statistical Information Service (KOSIS) API into a tidy, analysis-ready format. While there are existing R packages for accessing Korean statistical data (such as the `kosis` package for general KOSIS data access), **there are currently no readily available software solutions specifically designed for retrieving and processing Korean census data in a tidy, analysis-friendly format** for demographic and socioeconomic research.

### The Challenge

Korean census and administrative data presents several challenges for researchers:

1. **Complex API URLs**: KOSIS APIs require detailed parameter specifications with cryptic codes
2. **Multiple data sources**: Different datasets (population, tax, mortality) come from different API endpoints
3. **Inconsistent formatting**: Raw data often comes in wide format with Korean-language column names
4. **Administrative code mapping**: Different datasets use different administrative code systems
5. **Data integration**: Combining multiple datasets requires careful matching of administrative units

This vignette shows how the `tidycensuskr` package addresses these challenges by providing pre-processed, tidy datasets and demonstrating the underlying data cleaning workflow.

## Raw Data Sources

The data cleaning process involves retrieving data from multiple KOSIS API endpoints:

You need to know following information to retrieve data from KOSIS API:

- **API Key**: You need to register for an API key from KOSIS.
- **API Parameters**: Each API endpoint requires specific parameters such as `itmId`, `objL1`, `objL2`, etc. These parameters define the data you want to retrieve.
- **Output Format**: The API supports various output formats, but we will use JSON for easier processing in R.


### API call parameters

| Parameter | Type | Description | Required? |
|-----------|------|-------------|-----|
| `apiKey` | String | Your KOSIS API key | Yes |
| `orgId` | String | Organization identifier | Yes |
| `tblId` | String | Table identifier | Yes |
| `objL1` | String | First class code | Yes |
| `objL2`-`objL8` | String | Second to eighth class code | Optional |
| `itmId` | String | Item identifier(s) | Yes |
| `prdSe` | String | Update cycle | Yes |
| `format` | String | Output format (e.g., JSON) | Yes |


### Data variables

To note, an output can include the following variables:

| Field name (all caps) | Description | Data type | Remarks |
|-----|--------------|-------|------|
| ORG_ID | Organization code | VARCHAR2(40) | Yes |
| TBL_ID | Table identifier | VARCHAR2(40) |  |
| TBL_NM | Table name | VARCHAR2(300) |  |
| C1 - C8 | Class identifier (1-8) | VARCHAR2(40) | 2-8 may be omitted if nonexisting |
| C1_OBJ_NM - C8_OBJ_NM | Class code (1-8) | VARCHAR2(3000) |  |
| C1_OBJ_NM_ENG - C8_OBJ_NM_ENG | Class code in English (1-8) | VARCHAR2(3000) |  |
| C1_NM - C8_NM | Class name (1-8) | VARCHAR2(3000) |  |
| C1_NM_ENG - C8_NM_ENG | Class name in English (1-8) | VARCHAR2(3000) |  |
| ITM_ID | Item identifier | VARCHAR2(40) |  |
| ITM_NM | Item name | VARCHAR2(3000) |  |
| ITM_NM_ENG | Item name in English | VARCHAR2(3000) |  |
| UNIT_ID | Unit identifier | VARCHAR2(40) |  |
| UNIT_NM | Unit name | VARCHAR2(1000) |  |
| UNIT_NM_ENG | Unit name in English | VARCHAR2(1000) |  |
| PRD_SE | Data update cycle | VARCHAR2(20) |  |
| PRD_DE | Data period | VARCHAR2(8) |  |
| DT | Data value | VARCHAR2(100) |  |
| LST_CHN_DE | Date of change | VARCHAR2(8) |  |

It is worth noting that all fields are character, and numeric values are stored as strings. This means that you will need to convert them to numeric types after loading the data.


### API Key
To use the KOSIS API, you need to register for an API key. You can obtain one by visiting the [KOSIS API registration page](https://kosis.kr/openapi/serviceUse/serviceUseUnityReg_01Detail.do) and following the instructions.

The webpage is only available in Korean and may not be available outside of Korea.


### Example URL

The URL below is an example of how to retrieve general tax data from KOSIS. You will need to replace `인증키없음` with your actual API key. Dissecting this URL can help you understand how to construct your own API calls.

```{r tax-url}
url_tax_general <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T001+&objL1=A0201+A0202+A0203+A0204+A0205+A0206+A0207+A0208+A0209+A0210+A0211+A0212+A0213+A0214+A0215+A0216+A0217+A0218+A0219+A0220+A0221+A0222+A0223+A0224+A0225+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0419+A0420+A0421+A0422+A0423+A0424+A0425+A0426+A0427+A0428+A0429+A0430+A0431+A0501+A0502+A0503+A0504+A0505+A0506+A0507+A0508+A0509+A0510+A0511+A0512+A0513+A0514+A0515+A0516+A0517+A0518+A0601+A0602+A0603+A0604+A0605+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0801+A0802+A0803+A0804+A0805+A0806+A0807+A0808+A0809+A0810+A0811+A0812+A0813+A0814+A0815+A09+A1001+A1002+A1003+A1004+A1005+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1209+A1210+A1211+A1212+A1213+A1214+A1215+A1216+A1217+A1218+A1219+A1220+A1221+A1222+A1309+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1417+A1418+A1419+A1420+A1421+A1422+A1423+A1501+A1502+A1503+A1504+A1505+A1506+A1507+A1508+A1509+A1510+A1511+A1512+A1513+A1514+A1515+A1516+A1601+A1602+A1603+A1604+A1605+A1701+A1702+A1703+A1704+A1705+A1706+A1707+A1708+A1709+A1710+A1711+A1712+A1713+A1714+A1715+A1716+A1717+A1718+A1802+A1801+&objL2=15133SGH0M+&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133N_A3212"
```


| Parameter | Value |
|-----------|-------------------|
| `apiKey` | "인증키없음" (meaning "no authentication") as it is for demonstration. |
| `orgId` | "133" |
| `tblId` | "DT_133N_A3212" |
| `objL1` | "A0201", "A0202", ... , "A1801" |
| `objL2` | "15133SGH0M" |
| `itmId` | "T001" |
| `prdSe` | "Y" |
| `newEstPrdCnt` | "1" |
| `outputFields` | "TBL_ID", "TBL_NM", ... , "UNIT_NM_ENG" |


Adding more context, multiple values for `objL1` and `itmId` can be specified by separating them with a plus sign (`+`). For example, `itmId=T001+T002+T003` would retrieve data for multiple items.

Try interpreting the following URL to understand how to construct your own API calls:

```{r pop-url}
url_pop <-
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T00+T60+&objL1=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21510+22010+22020+22030+22040+22050+22060+22070+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33020+33030+33040+33041+33042+33043+33044+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37510+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38030+38050+38060+38070+38080+38090+38100+38110+38111+38112+38113+38114+38115+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+&objL2=ALL&objL3=000+&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1IN1509"
```


## Data Retrieval

### Setting up API Access
```r
library(tidycensuskr)
library(dplyr)
library(tidyr)
library(kosis)

# Set KOSIS API key (stored in a secure file)
kosiskey <- readLines("~/.kosiskey")[1]
tidycensuskr::set_kosis_key("~/.kosiskey")
```

### Downloading Raw Data
```r
# Download raw datasets from KOSIS API
df_tax <- kosis::getStatDataFromURL(url_tax_general)

# Load administrative district lookup table
sgg_lookup <-
  read.csv(
    system.file(
      file.path("extdata", "lookup_district_code.csv"),
      package = "tidycensuskr"
    ),
    fileEncoding = "EUC-KR"
  )
```

## Data Cleaning Workflow

### 1. Administrative Code Mapping

The first challenge is creating a consistent mapping between different administrative code systems used across datasets:

```r
# Create administrative code mapping for provinces (sido)
sidocd_range <- tibble::tribble(
  ~sido_kr, ~sido_cd, ~sido_txcd,
  "서울특별시", "11", "02",
  "부산광역시", "21", "15", 
  "대구광역시", "22", "13",
  "인천광역시", "23", "03",
  "광주광역시", "24", "10",
  "대전광역시", "25", "06",
  "울산광역시", "26", "16",
  "세종특별자치시", "29", "09",
  "경기도", "31", "04",
  "강원특별자치도", "32", "05",
  "충청북도", "33", "07",
  "충청남도", "34", "08",
  "전라북도", "35", "11",
  "전라남도", "36", "12",
  "경상북도", "37", "14",
  "경상남도", "38", "17",
  "제주특별자치도", "39", "18"
)
```

### 2. Tax Data Processing

Transform raw tax data into a standardized format:

```r
df_tax_compact <- df_tax |>
  dplyr::transmute(
    adm2_code = C1,           # Administrative code
    value = DT   # Tax value in million KRW
  ) |>
  dplyr::inner_join(
    sgg_lookup[, c("sgg_tax_global", "sido_en", "sigungu_1_en", "adm2_code")],
    multiple = "first"
  )

```

### 3. Population Data Processing

Clean and reshape population data with gender disaggregation:

```r
df_pop2 <- df_pop |>
  dplyr::mutate(
    sex = plyr::mapvalues(C2, c(0, 1, 2), c("total", "male", "female")),
    type = plyr::mapvalues(ITM_ID, c("T00", "T60"), c("population_total", "population_nonrelative"))
  ) |>
  dplyr::select(C1, C1_NM, sex, type, DT) |>
  tidyr::pivot_wider(
    names_from = c(type, sex),
    values_from = DT
  ) |>
  dplyr::rename(
    sigungu_cd = C1,
    sigungu_kr = C1_NM
  ) |>
  dplyr::mutate(
    adm2_code = as.integer(adm2_code)
  ) |>
  dplyr::inner_join(
    sgg_lookup[, c("adm2_code", "sido_en", "sigungu_1_en")],
    by = "adm2_code",
    multiple = "first"
  )
```

## Transforming to Tidy Format

The key feature of `tidycensuskr` is converting wide-format, multiple-dataset structure into a single, long-format tidy dataset:

### 1. Convert Each Dataset to Long Format

```r
# Tax data to long format
df_tax_long <- df_tax_compact |>
  dplyr::select(2:5) |>
  tidyr::pivot_longer(
    cols = "value"
  ) |>
  dplyr::mutate(
    type = "tax",
    class1 = "global",
    class2 = "total",
    unit = "million KRW"
  ) |>
  dplyr::select(-name)

# Population data to long format
df_pop_long <- df_pop2 |>
  dplyr::select(-2) |>
  tidyr::pivot_longer(
    cols = 2:7
  ) |>
  tidyr::separate(col = "name", into = c("type", "class1", "class2"), sep = "_")

```

### 2. Combine into Single Tidy Dataset

```r
# Bind all datasets into one comprehensive long-format dataset
censuskor <- dplyr::bind_rows(
    df_tax_long,
    df_pop_long
  ) |>
  dplyr::rename(
    adm1 = sido_en,
    adm2 = sigungu_1_en
  ) |>
  dplyr::mutate(
    year = 2020
  ) |>
  dplyr::select(
    year, adm1, adm2, adm2_other, adm2_code,
    type, class1, class2, unit, value
  )
```

## The Result: A Tidy Dataset

The final `censuskor` dataset has a consistent structure:

- **year**: Census year (2020)
- **adm1**: Province-level administrative name (in English)
- **adm2**: District-level administrative name (in English)  
- **adm2_code**: Numeric administrative code
- **type**: Data type ("population", "tax", "mortality")
- **class1**: First-level classification (e.g., "global", "income", "All causes")
- **class2**: Second-level classification (e.g., "total", "male", "female")
- **unit**: Unit of measurement
- **value**: Numeric value

### Example Usage

```r
library(tidycensuskr)
library(dplyr)

# Load the cleaned dataset
data(censuskor)

# View the structure
head(censuskor)

# Filter for Seoul population data
seoul_pop <- censuskor |>
  filter(adm1 == "Seoul", type == "population", class1 == "population") |>
  select(adm2, class2, value) |>
  pivot_wider(names_from = class2, values_from = value)

head(seoul_pop)
```

These data can be joined with `sf` objects bundled in the package to make maps and perform spatial analysis

## Notes on `adm2_code` changes
Districts (_si-gun-gu_) are the second-level administrative units in South Korea. Each district is assigned a unique code called `adm2_code`. However, these codes can change over time due to administrative boundary changes, mergers, or reclassifications.

Notable changes were made in all rural districts (_gun_) in 2022; the third digit of `adm2_code` got addition of 2. For example, _Ulleung-gun_ changed from `37430` to `37630`. This change may affect longitudinal analyses that span multiple years and pose challenges in updating `censuskor` dataset by developers. Some datasets in KOSIS are **retrospectively** updated to reflect the new codes, while most KOSIS datasets retain the old codes. Therefore, developers should be advised to check the `adm2_code` for each dataset and year with this code:

```r
# imported data
some_census_table

# check adm2_code
any(substr(some_census_table$adm2_code, 3, 3) %in% c("5", "6"))
```

If `TRUE`, it indicates that the dataset uses the **new** `adm2_code` system (post-2022) regardless of the raw data year. If `FALSE`, it uses the **old** system (pre-2022).


**Note**: This vignette shows the data cleaning process for developers. The actual `censuskor` dataset in the package is the result of this workflow and is ready for immediate use.
