---
title: "Parsing Functions in edgarWebR"
output: rmarkdown::html_vignette
date: 2017-10-12
author: "Micah J Waldstein <micah@waldste.in>"
vignette: >
  %\VignetteIndexEntry{Parsing Functions in edgarWebR}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
projects: ['edgarWebR']
categories: ['RStats']
type: post
draft: false
---
```{r setup, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
library(edgarWebR)
set.seed(0451)
# Cache http requests
library(httptest)
start_vignette("parsing")
```
New to edgarWebR 0.2.0 are functions for parsing SEC documents. While there are
good R packages for XBRL processing, there is a gap in extracting information
from other document types available via the site. edgarWebR currently provides
functions for 2 of those -

 * `parse_submission()` - Processes a raw SGML filing into component
   documents. These are the 'Complete submission text file' on filing pages.
   Similar to zip files, they contain all the files included in particular
   submission.
 * `parse_filing()` - Processes a narrative filing (e.g. 10-K, 10-Q) into
   paragraphs annotated with part and item numbers. In a submission with many
   files, this is the main form.

This vignette will show how to use both functions to find the risks reported by
in a company's recent filing.

## Find a Submission
Using edgarWebR functions, we'll first look up a recent filing.
```{r companyInfo}
ticker <- "STX"

filings <- company_filings(ticker, type = "10-Q", count = 40)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-Q", ]
# We're only interested in a particular filing
filing <- filings[filings$filing_date == "2017-10-27", ]
filing$md_href <- paste0("[Link](", filing$href, ")")
knitr::kable(filing[, c("type", "filing_date", "accession_number", "size",
                                "md_href")],
             col.names = c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark = ","))
```

## Get the Complete Submission File
We'll next get the list of files and find the link to the complete submission.
```{r document}
docs <- filing_documents(filing$href)
doc <- docs[docs$description == 'Complete submission text file', ]
doc$md_href <- paste0("[Link](", doc$href, ")")

knitr::kable(doc[, c("seq", "description", "document", "size",
                     "md_href")],
             col.names = c("Sequence", "Description", "Document",
                           "Size", "Link"),
             digits = 2,
             format.args = list(big.mark = ","))
```

Normally, we would use `filing_documents()` to get to the 10-Q directly, but as
an example we'll be using the complete submission file to demonstrate the
`parse_submission()` function. You would want to use the complete submission
file if you want to access the full list of files - e.g. in this case there are
80 files in the submission, but only 10 available on the website and therefore
available to `filing_documents()` - or if you worry about efficiency and are
downloading all of the documents.

## Parse the Complete Submission File
Now that we have the link to the complete submission file, we can parse it into
components.
```{r parse_submission}
parsed_docs <- parse_submission(doc$href)
knitr::kable(head(parsed_docs[, c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names = c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark = ","))
```

And just for example, here's the end of the full list - note the excel that
isn't on the SEC site for instance.
```{r parse_submission_tail}
knitr::kable(tail(parsed_docs[, c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names = c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark = ","))
```

The 10-Q Filing document is Seq. 1, with the full text of the document in the
TEXT column.
```{r show_text}
# NOTE: the filing document is not always #1, so it is a good idea to also look
# at the type & Description
filing_doc <- parsed_docs[parsed_docs$TYPE == '10-Q' &
                          parsed_docs$DESCRIPTION == '10-Q', 'TEXT']
substr(filing_doc, 1, 80)
```
We can see that contains the raw document. For document types which are not
plain text, e.g. the XBRL zip file, the content is uuencoded and would been
further processing.

## Parse the Filing Document
Fortunately edgaWebR functions that take URL's will also take a string
containing the document, so to parse the document, while we could have passed the URL
to the online document we can just pass in the full string.
```{r parseFiling}
doc <- parse_filing(filing_doc, include.raw = TRUE)
unique(doc$part.name)
unique(doc$item.name)
head(doc[grepl("market risk", doc$item.name, ignore.case = TRUE), "text"], 3)
risks <- doc[grepl("market risk", doc$item.name, ignore.case = TRUE), "raw"]
```

Now the document is all ready for whatever further processing we want. As a
quick example we'll pull out all the italicized risks.
```{r parseRisks}
risks <- risks[grep("<i>", risks)]
risks <- gsub("^.*<i>|</i>.*$", "", risks)
risks <- gsub("\n", " ", risks)
risks
```

This is a fairly simplistic example, but should serve as a good tutorial on
processing filings.

## How to Download
edgarWebR is available from CRAN, so can be simply installed via
```{r eval=FALSE}
install.packages("edgarWebR")
```

If you want the latest and greatest, you can get a copy of the development version
from github by using devtools:
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("mwaldstein/edgarWebR")
```
```{r, include=FALSE}
# Cleanup
end_vignette()
```