---
title: "Re-submission and debugging"
author: "George G. Vega Yon"
date: "`r Sys.Date()` (Last revision Feb 13, 2020)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Job-resubmission}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Want it or not, a lot of times jobs fail. In such cases, it could be hard to
figure out what went wrong. The `slurmR` package has some tools that can help
you deal with this.

The documentation that follows applies for job submitted with sbatch, this is,
job that were submitted using either `Slurm_lapply`, `Slurm_sapply`, `Slurm_Map`,
or `Slurm_EvalQ`.

# Checking logs

When calling any of the `*apply` family functions, `slurmR` creates a folder with
the name equal to `job_name` in `tmp_path` as follows:

```{r getting-names, echo=FALSE}
file_names <- list(
  r = slurmR::snames("r", tmp_path = "[tmp_path]", job_name = "[job-name]"),
  sh = slurmR::snames("sh", tmp_path = "[tmp_path]", job_name = "[job-name]"),
  out = slurmR::snames("out", tmp_path = "[tmp_path]", job_name = "[job-name]"),
  rds = slurmR::snames("rds", tmp_path = "[tmp_path]", job_name = "[job-name]")
)

file_names <- lapply(
  file_names, gsub, pattern = ".+/(?=[0-9])", replacement = "",
  perl = TRUE
  )

file_names <- lapply(file_names, function(f) paste0("`", f, "`"))
```

- `r file_names$r`: The R script that is used to load the data, and execute
  whatever the instruction is (`sapply`, `lapply`, `Map`, etc.).

- `r file_names$sh`: The Slurm configuration bash file. This passes all the SBATCH
  options the user specified and calls `Rscript` to submit the job.

- `r file_names$out`: The name-pattern for the log files generated by Rscript.
  In the case of job-arrays, the pattern `%A` is the jobid and `%a` is the
  Array id. This is usually the place where to look for useful information on
  why the script failed.

- `r file_names$rds`: The name pattern of the output `rds` files. Usually, the
  jobs end-up writing an output, e.g. the results from the `lapply` call, and
  the `%i` in the pattern indicates the array id.
  
- `*.rds` Further R objects that were exported for this particular job. In the
  case of `Slurm_lapply`, for example, it usually includes `X1.rds`, `X2.rds`,
  ..., `X[njobs].rds` files. Other R objects needed for the call will be 
  saved in this same folder as well.
  
If there's an issue with the submitted job, the user can take a look at these
files. In general, looking at the log files is enough to figure out what could
be going on. Let's see the following example:


1. We are submitting a job that runs a complicated algorithm 
```r
library(slurmR)
x <- Slurm_lapply(
  1:1000, function(x) complicated_algorithm(x),
  njobs = 4,
  plan = "submit"
  )
```

By printing the output, you may see something like this:

```r
x
  Call:
 Slurm_lapply(X = 1:1000, FUN = function(x) complicated_algorithm(x), njobs = 4,
    plan = "submit")
job_name : slurmr-job-5724cb1616
tmp_path : /auto/rcf-40/vegayon/slurmR/slurmr-job-5724cb1616
job ID   : 6163924
Status: All jobs are pending resource allocation or are on it's way to start. (Code 1)
This is a job array. The status of each job, by array id, is the following:
 done      :  -
 failed    :  -
 pending   :  -
 running   :  1, 2, 3, 4.
```

The problem is, what happens if one of these fails, for example, 1 and 3:

```r
x
  Call:
 Slurm_lapply(X = 1:1000, FUN = function(x) complicated_algorithm(x), njobs = 4,
    plan = "submit")
job_name : slurmr-job-5724cb1616
tmp_path : /auto/rcf-40/vegayon/slurmR/slurmr-job-5724cb1616
job ID   : 6163924
Status: One or more jobs failed. (Code 99)
This is a job array. The status of each job, by array id, is the following:
 done      :  2, 4.
 failed    :  1, 3.
 pending   :  -
 running   :  -
```

We can check the log-files of the failed jobs using `Slurm_log`, for example,
if we wanted to checkout the log-file of the first job of the array, we can
type:

```r
Slurm_log(x, which. = 1)
```

By default, while in interactive mode, you will get a prompt telling you that
`less` (the default) will be called using the `system2` command, and asking you
if you wish to continue. You can change the way to checkout the log file by
using an alternative command, like `cat`, e.g.:

```r
Slurm_log(x, which. = 1, cmd = "cat")
```

Again, while in interactive mode, you will get a prompt asking you to enter `"y"`
or `"n"`. If the command fails, it is usually due to a missing log, either
you entered an invalid number in `which.`, or the job-array didn't started the
log-file. If the error has to do with the later, then you can always inspect
the files located in the job folder using command line tools:

```bash
$ cd /path-to-the-temp-dir/path-to-the-job-name/
```

 
# Job-resubmission

Following the previous case, let's imagine that the failure was due to some
unexpected error (the node failed), so we can resubmit the job, in order to
do such, we can use the function `sbatch` like it follows:

```r
# Recall that x is a slurm_job object
sbatch(x, array = "1,3")
```

This will re-submit the job, but only the components 1 and 3. Once it is done,
the user can collect the results using `Slurm_collect`. This will read in
the results of all jobs, not just 1 and 3.

If for some reason the R session was closed before been able to save the `slurm_job`
object, users can always recover the `slurm_job` object by using the `read_slurm_job`
function, e.g.:

```r
# Starting from a fresh session
library(slurmR)

# By typing the path to the job folder, slurmR will recover the job
x <- read_slurm_job("/path-to-the-temp-dir/path-to-the-job-name/")
```

