---
title: "The ROCKproject file format"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The ROCKproject file format}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette describes version 1.0 of the ROCK project file format.

ROCK project files have extension `.ROCKproject` and are ZIP archives. They contain two things:

- Files containing the data, ideally in a deliberately designed set of sub-directories to facilitate tracing the data through different stages of processing and analysis;

- Files containing settings and directives for applications and processing of the data.

The former are raw data files and ROCK files. ROCK files are plain text files with the `.rock` extension.

The latter are YAML files. Of these, the only required one is the `_ROCKproject.yml` file. This file must always be a regular YAML file that contains a map with key `_ROCKproject`. This map in turn must contain maps with keys `project`, `codebook`, `sources`, and `workflow`.

The `project` map contains project metadata, such as the project's `title`, its `authors`, optional (but strongly recommended!) author identifiers in `authorIds`, the project's `version`, the version of the ROCK standard used in the project (with key `ROCK_version`), the version of the ROCK project file (with key `ROCK_project_version`), the date the project was created (with key `date_created`), and the date the project was last modified (with key `date_modified`).

The `codebook` map contains the project's codebook, either embedded or by linking to it. The `codebook` key can also have value `~` (NULL) if not codebook information is specified (or the codebook is embedded in the ROCK files). Valid keys to be specified with the `codebook` map are `urcid`, `embedded`, and `local`. The `urcid` key can store the project's Unique ROCK Codebook Identifier (i.e. its URCID) as a URL to a ROCK codebook in spreadsheet (`.xlsx` or `.ods` format) or YAML (`.yml` or `.rock`) format.

The `sources` map specifies where the project's data resides. This is specified in terms of regular expressions. The first valid key is `extension`, which is not a regular expression but can be used to conveniently specify that files with a given extension must be imported. This is used if `regex` is `~` (NULL, i.e. unspecified). However, if a value is specified for `regex`, a program importing a ROCK project should ignore whatever is specified for `extension`. The value stored in the `dirsToIncludeRegex` key should be a regular expression indicating which directories contain the data (i.e. the ROCK files forming the project). The `recursive` key can be `true` or `false` and indicates whether all subdirectories of matched directories should be imported too. The `dirsToExcludeRegex` regular expression can be used to ignore directories. In addition, if `filesToIncludeRegex` is specified, only files matching that regular expression should be imported; and if `filesToExcludeRegex` is specified, files matching that regular expression should be ignored.

Finally, the `workflow` map described the workflow and data management template used in this project. It consists of a `pipeline` and `actions`. The `pipeline` is a sequence of stages, each with an identifier (in key `stage`); the directory containing files in that stage (in key `dirName`; note that this is a single directory name, not a regular expression!); and a sequence of one or more next stage (with key `nextStages`). Each element in `nextStages` has a `nextStageId` key and a `actionId`. The `nextStageId` specifies to which stage files transfer (i.e. are saved) when the action with the corresponding `actionId` is executed. These `actions` are stored in a sequence where each element has an `actionId`; a `language` specified the programming language the action is specified in; one or more `dependencies` (typically packages that need to be loaded in that programming environment before the `script` can be executed), and a `script` section specifying the commands to run to execute that action. In this script, two placeholders can be used: `{currentStage::dirName}` will be replaced with the contents of `dirName` for the current stage; and `{nextStage::dirName}` will be replaced with the contents of `dirName` for the next stage. The latter part of these expressions (`dirName` in both of these examples) can be replaced by other keys specified in each stage to allow setting parameters in the pipeline specification.

An example of a `_ROCKproject.yml` file is included below.

```yaml

_ROCKproject:

  project:

    title: "The Alice Study"                     # Any character string
    authors: "Author names as string"            # Any character string
    authorIds:
      -
        display_name: "Talea Cornelius"          # Any character string
        orcid: "0000-0001-7181-0981"             # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
        shorcid: "ip6b381"                       # Any character string matching ^([0-9a-zA-Z]+$
      -
        display_name: "Gjalt-Jorn Peters"        # Any character string
        orcid: "0000-0002-0336-9589"             # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
        shorcid: "it36ll9"                       # Any character string matching ^([0-9a-zA-Z]+$

    version: "1.1"                               # Anything matching regex [0-9]+(\\.[0-9]+)*
    ROCK_version: 1                              # Anything matching regex [0-9]+(\\.[0-9]+)*
    ROCK_project_version: 1                      # Anything matching regex [0-9]+(\\.[0-9]+)*
    date_created: "2023-03-01 20:03:51 UTC"      # Anything matching that date format, preferably converted to UTC timezone
    date_modified: "2023-03-08 20:03:51 UTC"     # Anything matching that date format, preferably converted to UTC timezone

  codebook:
    urcid: ""
    embedded: ~
    local: ""

  sources:

    extension: ".rock"                           # Any valid extension
    regex: ~                                     # Any regex or ~
    dirsToIncludeRegex: data/                    # Any regex or ~
    recursive: true                              # true or false
    dirsToExcludeRegex: ~                        # Any regex or ~
    filesToIncludeRegex: ~                       # Any regex or ~
    filesToExcludeRegex: ~                       # Any regex or ~

  workflow:

    pipeline:
      -
        stage: raw                               # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/010---raw-sources"        # Any valid directory name, using a forward slash as separator
        nextStages:
          -
            nextStageid: clean                   # A different stage identifier or ~
            actionId: cleanSource
          -
            nextStageid: uids                    # A different stage identifier or ~
            actionId: addUIDs
      -
        stage: clean                             # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/020---cleaned-sources"    # Any valid directory name, using a forward slash as separator
        nextStages:
          -
            nextStageid: uids                    # A different stage identifier or ~
            actionId: addUIDs
      -
        stage: uids                              # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/030---sources-with-uids"  # Any valid directory name, using a forward slash as separator
        nextStage: coded                         # A different stage identifier or ~
      -
        stage: coded                             # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/040---coded-sources"      # Any valid directory name, using a forward slash as separator
        nextStage: masked                        # A different stage identifier or ~
      -
        stage: masked                            # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/090---masked-sources"     # Any valid directory name, using a forward slash as separator
        nextStage: ~                             # A different stage identifier or ~

    actions:
      -
        actionId: addUIDs                        # String, referenced from the stages
        language: R                              # Language, has to be matched to interpreter
        dependencies: rock                       # Dependencies to be loaded before running the script
        script: |                                # Literal block style string
          rock::prepend_ids_to_sources(
            input = {currentStage::dirName},
            output = {nextStage::dirName}
          );

```