---
title: "Introducing edgarWebR"
output: rmarkdown::html_vignette
date: 2017-08-02
author: "Micah J Waldstein <micah@waldste.in>"
vignette: >
  %\VignetteIndexEntry{Introducing edgarWebR}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
projects: ['edgarWebR']
categories: ['RStats']
type: post
draft: false
---
<!-- ```{r setup, echo = FALSE, message = FALSE} -->
```{r setup}
knitr::opts_chunk$set(collapse = T, comment = "#>")
library(edgarWebR)
library(dplyr, quietly=TRUE)
library(purrr, quietly=TRUE)
library(ggplot2)
set.seed(0451)
# Cache http requests
library(httptest)
start_vignette("intro")
```
There are plenty of packages for R that allow for fetching and manipulation of
companies' financial data, often fetching that direct from public filings with
the SEC. All of these packages have the goal of getting to the XBRL data,
containing financial statements, typically in annual (10-K) or quarterly (10-Q)
filings.

SEC filings however contain far more information. edgarWebR is the first step
in accessing that data by providing an interface to the SEC EDGAR search tools
and the metadata they provide.

## Current Features
 * Search for companies and mutual funds.
 * List filings for a company or mutual fund.
 * Get all information associated with a particular filing

## Simple Usecase
Using information about filings, we can use edgarWebR to see how long after the
close of a financial period it takes for a company to make a filing. We can
also see how that time has changed.

### Get Data
First, we'll fetch a bunch of 10-K and 10-Q filings for our given company using
`company_filings()`. To make sure we're going far enough back we'll take a peak
at the tail of our results
```{r companyInfo}
ticker <- "EA"

filings <- company_filings(ticker, type = "10-", count = 100)
initial_count <- nrow(filings)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-K" | filings$type == "10-Q", ]
```

Note that explicitly filtering caused us to go from `r initial_count` to
`r nrow(filings)` rows.

```{r}
filings$md_href <- paste0("[Link](", filings$href, ")")
knitr::kable(tail(filings)[, c("type", "filing_date", "accession_number", "size",
                               "md_href")],
             col.names = c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark = ","))
```

We've received filings dating  back to 1995 which seems good enough for our
purposes, so next we'll get the filing information for each filing.

So far we've done everything in base R, but now we'll use some useful functions
from dplyr and purrr to make things a bit easier.
```{r filingInfo}
# this can take a while - we're fetching ~100 html files!
filing_infos <- map_df(filings$href, filing_information)

filings <- bind_cols(
                     filings[, !(names(filings) %in% names(filing_infos))],
                     filing_infos)
filings$filing_delay <- filings$filing_date - filings$period_date

# Take a peak at the data
knitr::kable(head(filings) %>% select(type, filing_date, period_date,
                                      filing_delay, documents, bytes) %>%
             mutate(filing_delay = as.numeric(filing_delay)),
             col.names = c("Type", "Filing Date", "Period Date", "Delay",
                           "Documents", "Size (B)"),
             digits = 2,
             format.args = list(big.mark = ","))
```

### Basic Analysis
Now our data is arranged, we can run some summary statistics and plot using
ggplot2.
```{r filingAnalysis}
knitr::kable(filings %>%
             group_by(type) %>% summarize(
               n = n(),
               avg_delay = as.numeric(mean(filing_delay)),
               median_delay = as.numeric(median(filing_delay)),
               avg_size = mean(bytes / 1024),
               avg_docs = mean(documents)
             ),
             col.names = c("Type", "Count", "Avg. Delay", "Median Delay",
                           "Avg. Size", "Avg. Docs"),
             digits = 2,
             format.args = list(big.mark = ","))
```

No surprise, yearly filings (10-K's) are larger and take more time than
quarterly filings (10-K's). Lets see how the times are distributed:

```{r plotDelay, fig.width=6}
ggplot(filings, aes(x = factor(type), y = filing_delay)) +
  geom_violin() + geom_jitter(height = 0, width = 0.1) +
  labs(x = "Filing Date", y = "Filing delay (days)")
```

We can also examine how the filing delay has changed over time:
```{r plotType, fig.width=6}
ggplot(filings, aes(x = filing_date, y = filing_delay, group = type, color = type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing delay (days)")
```

Whether because of some internal change or change to SEC rules, the time
between the end of the fiscal year and the 10-K has dropped quite a bit, though
there is still a bit of variation.

An interesting extension would be to compare the filing delays to the company's
stock price, overall market performance and other companies to see if there are
particular drivers to the pattern.

```{r plotSize, fig.width=6}
ggplot(filings, aes(x = filing_date, y = bytes / 1024, group = type, color = type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing Size (KB)")
```

The jump in size ~2010 is the requirement for inclusion of financial datafiles
starting, but there is still interesting variations.

## More to come
The SEC contains a wealth of information and we're only scratching the surface.
edgarWebR itself has a lot more functionality than what we've explored here and
there is more coming.

## How to Download
edgarWebR is available from CRAN, so can be simply installed via
```{r eval=FALSE}
install.packages("edgarWebR")
```

If you want the latest and greatest, you can get a copy of the development version
from github by using devtools:
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("mwaldstein/edgarWebR")
```
```{r, include=FALSE}
# Cleanup
end_vignette()
```