---
title: "Updating Datasets on DataONE"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Updating Datasets on DataONE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\usepackage[utf8]{inputenc}
---

## Updating A DataONE Package

A DataONE package is a collection of datasets and other files that are described by a metadata file.

A *DataPackage* is a `dataone` R object that can contain a DataONE package that is either downloaded from
DataONE or newly created locally in R and then uploaded to DataONE.

After a package has been uploaded to a DataONE Member Node, it may be determined by the package submitter or other
interested parties that the package needs to be updated, for example to add a missing file, replace one file 
with another, or remove a package member from the package.

These types of modifications can be accomplished by downloading the package from DataONE using the 
`getDataPackage()` method to create a local copy of the package in R, modifying the package contents locally,
then uploading the modified package to DataONE. 

The complete example script used to perform a package update is shown below:

```{r}
library(datapack)
library(dataone)
```
```{r, eval=FALSE}
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- "resource_map_urn:uuid:a9aeefcf-228c-4534-b4ad-b480a937be7d"
```
```{r, eval=FALSE}
pkg <- getDataPackage(d1c, identifier=packageId, lazyLoad=TRUE, quiet=FALSE)
```
```{r, eval=FALSE,message=FALSE}
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')
zipfile <- system.file("extdata/Strix-occidentalis-obs.csv.zip", package="dataone")
pkg <- replaceMember(pkg, objId, replacement=zipfile, formatId="application/octet-stream")
auxFile <- system.file("extdata/WeatherInf.txt", package="dataone")
auxObj <- new("DataObject", format="text/plain", filename=auxFile) 
pkg <- addMember(pkg, auxObj, metadataId)

```
<!-- Need authentication, so can't upload from a vignette. -->
```{r,eval=FALSE}
newPackageId <- uploadDataPackage(d1c, pkg, public=TRUE, quiet=FALSE)
```

This example script downloads the example package that was created and uploaded to DataONE
in the vignette `upload-data`.

Each line of this script will be described in the following sections.

### 1. Download the package from DataONE

The first step in updating a package is to download the package from DataONE into R, so that it 
can  be modified locally using methods in the `dataone` package. Modifications can be made to 
the package such as adding  or removing members from the package, or changing the contents of 
a package member, as would be the case if the wrong file was initially uploaded.

The `getDataPackage()` method downloads all files belonging to the package specified by
the `identifier` parameter. 

```{r, eval=FALSE}
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
```
```{r, eval=FALSE}
pkg <- getDataPackage(d1c, identifier=packageId, lazyLoad=TRUE, quiet=FALSE)
```
Because packages might contain large files or a large number of files, it is
possible to `lazy load` the package. This means that the metadata describing the files is downloaded 
downloaded, but the file contents, the data bytes,  are not. For more information on using the 
'lazyLoad' parameter, see the R console help for the 'getDataPackage()' function.

For the package update that will be shown, it is not necessary to download the data bytes of each
package member, so `lazy loading` is acceptable. An example when lazy loading would not be appropriate is when the files in a package will be used for local computational processing.

Note that metadata files, such as the EML in this example, are always downloaded regardless of the `lazyLoad` parameter value.

### 2. Review package contents.

The downloaded package can be viewed by typing the DataPackage object name at the console, which invokes the
R `show` method for the object:
```{r, echo=FALSE}
saveWidth <- getOption("width")
options(width=100)
```
```{r, eval=FALSE}
pkg
```
```{r, echo=FALSE}
options(width=saveWidth)
```
Note that the `show` output for a DataPackage is condensed to fit the width of the current R console. If 
the output is condensed and more detail is required, set the R console width to a larger value, for example,
`options(width=120)`, or if using `Rstudio`, widen the console window by clicking and dragging on 
the window boarder.

### 3. Modify DataObjects in the package
The original uploaded package included the file `Strix-occidentalis-obs.csv`. This file can be substituted
for a different file, as would be necessary if it was determined that the zipped form of the
file should have been used instead.

First, determine which DataObject in the DataPackage `pkg` contains the file to be replaced:
```{r, eval=FALSE}
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')
```

The `selectMember` method inspects every DataObject in the package `pkg` and checks 
for a match in the R S4 slot specified by the `name` argument for the value specified with 
the `value` argument. The identifier for any matching DataObject is returned. 

The documentation for the DataObject R slots available can be viewed with the command
`help("DataObject-class")`. As a *SystemMetadata* object is contained in each DataObject, the slots
for SystemMetadata are available, with documentation viewable with `help("SystemMetadata-class")`

Next, update the DataObject in `pkg` to replace the file `Strix-occidentalis-obs.csv` with `Strix-occidentalis-obs.csv.zip` 
using the `replaceMember` method:
```{r,eval=FALSE}
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')
zipfile <- system.file("extdata/Strix-occidentalis-obs.csv.zip", package="dataone")
pkg <- replaceMember(pkg, objId, replacement=zipfile, formatId="application/octet-stream")
```

The `replaceMember` method updates the DataPackage `pkg`, replacing the data content of the DataObject 
with identifier `objId`, and updating the relevant system metadata slots such as `size` and `checksum`, 
to reflect the new contents of the DataObject. 

In this example, the file `WeatherInf.txt` was mistakenly omitted from the original package upload,
so add it now:

First, the identifier for the metadata DataObject that describes the package members will be retrieved 
from the DataPackage:

```{r, eval=FALSE}
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
```

The `selectMember` method returns the identifier of any DataObject in the package with an R slot name
specified in the parameter `name` that matches the value specified in the argument `value`. In this
particular package, there is only one DataPackage that has `formatId` of `eml://ecoinformatics.org/eml-2.1.1`, so only that identifier is returned.

Note that the `getValue` method can be used to retrieve the values for DataObject slots, for example:

```{r,eval=FALSE}
getValue(pkg, name="sysmeta@formatId")
```

An R named list is returned, with the names being the identifiers of each DataPackage in the DataPackage
and the values being the slot value.

Next, add a new package member that was omitted from the original package:

```{r,eval=FALSE}
auxFile <- system.file("extdata/WeatherInf.txt", package="dataone")
auxObj <- new("DataObject", format="text/csv", filename=auxFile)
pkg <- addMember(pkg, auxObj, metadataId)
```
The modified package can be reviewed before updating to DataONE:

```{r, echo=FALSE}
saveWidth <- getOption("width")
options(width=100)
```
```{r, echo=FALSE}
options(width=saveWidth)
```

### 4. Upload the modified DataPackage
Now upload the modified package to DataONE. Each DataObject in the DataPackage will be inspected by 
`uploadDataPackage` and DataObjects that have been modified will be updated and DataObjects that have been
added to the DataPackage will be uploaded. 
```{r, eval=FALSE}
newPackageId <- uploadDataPackage(d1c, pkg, public=TRUE, quiet=FALSE)
```

##  Updating Metadata For A DataPackage

As DataPackage members are modified, the DataObject that holds the metadata that describes the 
package members may become outdated. 

For example, it was shown above that a package member was updated to contain the
file `Strix-occidentalis-obs.csv.zip` instead of `Strix-occidentalis-obs.csv`. After this change, the DataObject
holding the EML metadata should be updated so that the EML element `objectName` for this 
dataset matched the new filename:

```{r, eval=FALSE}
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
nameXpath <- '//dataTable/physical/objectName[text()="Strix-occidentalis-obs.csv"]'
newName <- basename(zipfile)
pkg <- updateMetadata(pkg, metadataId, xpath=nameXpath, replacement=newName)
```

The `updateMetadata` method updates a DataObject containing XML, substituting the value located in the
document specified with the `xpath` argument with the value provided in the `replacement` argument.
In this example, the XML contained in the DataObject with identifier `metadataId` is an EML 
metadata document. The EML element `objectName` is updated with the value `Strix-occidentalis-obs.csv.zip`.


Note the the `xpath` argument uses the XPath query language, which is used to locate elements within an
XML document.

For EML metadata, another element that may need to be updated, if it is present in the original
EML file, is the distribution url. This element contains the link that can be used to download the file 
from DataONE, which includes the DataONE identifier for the object. Because the identifier value is set 
when the DataObject is created, the EML metadata is out of date and needs to be updated with the new 
identifier value.

In this example, the section of the metadata that describes the dataset `Strix-occidentalis-obs.csv.zip` will be updated. 

A portion of the EML from our example looks like this:
```
    <dataTable>
      <entityName>Strix Occidentalis</entityName>
      <entityDescription>A data file that contains only observations of Strix occidentalis</entityDescription>
      <physical>
        <objectName>Strix-occidentalis-obs.csv</objectName>
        <size unit="byte">6017</size>
        <dataFormat>
          <externallyDefinedFormat>
            <formatName>text/csv</formatName>
          </externallyDefinedFormat>
        </dataFormat>
        <distribution id="1430343425153">
          <online>
            <url function="download"></url>
          </online>
        </distribution>
      </physical>
      <entityType>Other</entityType>
    </dataTable>
```

The following lines get the current identifier for the DataObject that contains the
correct file, and uses this identifier value to construct a standard DataONE access
URL, using the base URL of the DataONE coordinating node.

```{r, eval=FALSE}
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv.zip')
newURL <- sprintf("%s/%s/resolve/%s", d1c@cn@baseURL, d1c@cn@APIversion, objId)
newURL
```

The `selectMember` method checks slot specified in `name` argument, in this case `sysmeta@fileName`, 
of each *DataObject* in the 
*DataPackage* `pkg` and returns the identifier of any member that matches the specified 
value. This package has only one *DataObject* with the name `Strix-occidentalis-obs.csv.zip` so only
one identifier is returned in `objId`.

Next, the metadata element for the distribution url will be updated in the *DataObject* that contains
the EML metadata in the package `pkg`

```{r, eval=FALSE}
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
xpathToURL <- "//dataTable/physical/distribution[../objectName/text()=\"OwlNightj.csv\"]/online/url"
pkg <- updateMetadata(pkg, do=metadataId, xpath=xpathToURL, replacement=newURL)
```

When the modified metadata file is updated to DataONE, a new identifier is required, 
as the modified metadata will replace the old one by created a new object in DataONE, 
and marking the old one as `obsoleted` by the new one. For this reason, the call 
to `updateMetadata` will assign a new identifier to the metadata *DataObject* in the
DataPackage `pkg`, if necessary.
