---
title: 'High Dimensional Data Visualization'
author: "Wayne Oldford and Zehao Xu"
date: "`r Sys.Date()`"
bibliography: references.bib
fontsize: 12pt
link-citations: yes
linkcolor: blue
output:
  rmarkdown::html_vignette:
    toc: true
geometry: margin=.75in
urlcolor: blue
graphics: yes
vignette: >
  %\VignetteIndexEntry{High Dimensional Data Visualization}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
header-includes:
- \usepackage{graphicx}
- \usepackage{epic}
- \usepackage{color}
- \usepackage{hyperref}
- \usepackage{multimedia}
- \PassOptionsToPackage{pdfmark}{hyperref}\RequirePackage{hyperref}
- \newcommand{\code}[1]{\texttt{#1}}
- \newcommand{\ve}[1]{\mathbf{#1}}
- \newcommand{\pop}[1]{\mathcal{#1}}
- \newcommand{\samp}[1]{\mathcal{#1}}
- \newcommand{\subspace}[1]{\mathcal{#1}}
- \newcommand{\sv}[1]{\boldsymbol{#1}}
- \newcommand{\sm}[1]{\boldsymbol{#1}}
- \newcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}
- \newcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}
- \newcommand{\size}[1]{\left\lvert {#1} \right\rvert}
- \newcommand{\norm}[1]{\left|\left|{#1}\right|\right|}
- \newcommand{\field}[1]{\mathbb{#1}}
- \newcommand{\Reals}{\field{R}}
- \newcommand{\Integers}{\field{Z}}
- \newcommand{\Naturals}{\field{N}}
- \newcommand{\Complex}{\field{C}}
- \newcommand{\Rationals}{\field{Q}}
- \newcommand{\widebar}[1]{\overline{#1}}
- \newcommand{\wig}[1]{\tilde{#1}}
- \newcommand{\bigwig}[1]{\widetilde{#1}}
- \newcommand{\leftgiven}{~\left\lvert~}
- \newcommand{\given}{~\vert~}
- \newcommand{\indep}{\bot\hspace{-.6em}\bot}
- \newcommand{\notindep}{\bot\hspace{-.6em}\bot\hspace{-0.75em}/\hspace{.4em}}
- \newcommand{\depend}{\Join}
- \newcommand{\notdepend}{\Join\hspace{-0.9 em}/\hspace{.4em}}
- \newcommand{\imply}{\Longrightarrow}
- \newcommand{\notimply}{\Longrightarrow \hspace{-1.5em}/ \hspace{0.8em}}
- \newcommand*{\intersect}{\cap}
- \newcommand*{\union}{\cup}
- \DeclareMathOperator*{\argmin}{arg\,min}
- \DeclareMathOperator*{\argmax}{arg\,max}
- \DeclareMathOperator*{\Ave}{Ave\,}
- \newcommand{\permpause}{\pause}
- \newcommand{\suchthat}{~:~}
- \newcommand{\st}{~:~}

---


```{r setup, include=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE,
                      message = FALSE,
                      fig.align = "center", 
                      fig.width = 7, 
                      fig.height = 5,
                      out.width = "60%", 
                      collapse = TRUE,
                      comment = "#>",
                      tidy.opts = list(width.cutoff = 65),
                      tidy = FALSE)
library(knitr)
set.seed(12314159)
imageDirectory <- "./images/highDim"
dataDirectory <- "./data/highDim"
path_concat <- function(path1, ..., sep="/") {
  # The "/" is standard unix directory separator and so will
  # work on Macs and Linux.
  # In windows the separator might have to be sep = "\" or 
  # even sep = "\\" or possibly something else. 
  paste(path1, ..., sep = sep)
}

library(ggplot2, quietly = TRUE)
library(dplyr, quietly = TRUE)
```

## Serialaxes coordinate

Serial axes coordinate is a methodology for visualizing the $p$-dimensional geometry
and multivariate data. As the name suggested, all axes are shown in serial. The axes can be a finite $p$ space or transformed to an infinite space (e.g. Fourier transformation). 

In the finite $p$ space, all axes can be displayed in parallel which is known as the parallel coordinate; also, all axes can be displayed under a polar coordinate that is often known as the radial coordinate or radar plot. In the infinite space, a mathematical transformation is often applied. More details will be explained in the sub-section `Infinite axes`

A point in Euclidean $p$-space $R^p$ is represented as a polyline in serial axes coordinate, it is found that a point <--> line duality is induced in the Euclidean plane $R^2$ [@146402]. 

Before we start, a couple of things should be noticed:

- In the serial axes coordinate system, no `x` or `y` (even `group`) are required; but other aesthetics, such as `colour`, `fill`, `size`, etc, are accommodated. 

- Layer `geom_path` is used to draw the serial lines; layer `geom_histogram`, `geom_quantiles`, and `geom_density` are used to draw the histograms, quantiles (*not `quantile` regression*) and densities. Users can also customize their own layer (i.e. `geom_boxplot`, `geom_violin`, etc) by editing function `add_serialaxes_layers`.

### Finite axes

Suppose we are interested in the data set `iris`. A parallel coordinate chart can be created as followings:

```{r serialaxes}
library(ggmulti)
# parallel axes plot
ggplot(iris, 
       mapping = aes(
         Sepal.Length = Sepal.Length,
         Sepal.Width = Sepal.Width,
         Petal.Length = Petal.Length,
         Petal.Width = Petal.Width,
         colour = factor(Species))) +
  geom_path(alpha = 0.2)  + 
  coord_serialaxes() -> p
p
```

A histogram layer can be displayed by adding layer `geom_histogram`

```{r serialaxes histogram,}
p + 
  geom_histogram(alpha = 0.3, 
                 mapping = aes(fill = factor(Species))) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 0.7))
```

A density layer can be drawn by adding layer `geom_density`

```{r serialaxes density}
p + 
  geom_density(alpha = 0.3, 
               mapping = aes(fill = factor(Species)))
```

A parallel coordinate can be converted to radial coordinate by setting `axes.layout = "radial"` in function `coord_serialaxes`. 

```{r radial, fig.width = 5}
p$coordinates$axes.layout <- "radial"
p
```

<font size="1"> 
Note that: layers, such as `geom_histogram`, `geom_density`, etc, are not implemented in the radial coordinate yet. 
</font>

### Infinite axes

@andrews1972plots plot is a way to project multi-response observations into a function $f(t)$, by defining $f(t)$ as an inner product of the observed values of responses and orthonormal functions in $t$

\[f_{y_i}(t) = <\ve{y}_i, \ve{a}_t>\]

where $\ve{y}_i$ is the $i$th responses and $\ve{a}_t$ is the orthonormal functions under certain interval. Andrew suggests to use the Fourier transformation

\[\ve{a}_t = \{\frac{1}{\sqrt{2}}, \sin(t), \cos(t), \sin(2t), \cos(2t), ...\}\]

which are orthonormal on interval $(-\pi, \pi)$. In other word, we can project a $p$ dimensional space to an infinite $(-\pi, \pi)$ space. The following figure illustrates how to construct an "Andrew's plot".

```{r andrews}
p <- ggplot(iris, 
            mapping = aes(Sepal.Length = Sepal.Length,
                          Sepal.Width = Sepal.Width,
                          Petal.Length = Petal.Length,
                          Petal.Width = Petal.Width,
                          colour = Species)) +
  geom_path(alpha = 0.2, 
            stat = "dotProduct")  + 
  coord_serialaxes()
p
```

A quantile layer can be displayed on top

```{r andrews with quantile}
p + 
 geom_quantiles(stat = "dotProduct",
                quantiles = c(0.25, 0.5, 0.75),
                linewidth = 2,
                linetype = 2) 
```

A couple of things should be noticed:

- mapping aesthetics is used to define the $p$ dimensional space, if not provided, all columns in the dataset 'iris' will be transformed. An alternative way to determine the $p$ dimensional space to set parameter `axes.sequence` in each layer or in `coord_serialaxes`.

- To construct a dot product serial axes plot, say Fourier transformation, "Andrew's plot", we need to set the parameter `stat` in `geom_path` to "dotProduct". The default transformation function is the Andrew's (function `andrews`). Users can customize their own, for example, Tukey suggests the following projected space 

  \[\ve{a}_t = \{\cos(t), \cos(\sqrt{2}t), \cos(\sqrt{3}t), \cos(\sqrt{5}t), ...\}\]
  
  where $t \in [0, k\pi]$ [@gnanadesikan2011methods]. 
  
    ```{r tukey}
    tukey <- function(p = 4, k = 50 * (p - 1), ...) {
      t <- seq(0, p* base::pi, length.out = k)
      seq_k <- seq(p)
      values <- sapply(seq_k,
                       function(i) {
                         if(i == 1) return(cos(t))
                         if(i == 2) return(cos(sqrt(2) * t))
                         Fibonacci <- seq_k[i - 1] + seq_k[i - 2]
                         cos(sqrt(Fibonacci) * t)
                       })
      list(
        vector = t,
        matrix = matrix(values, nrow = p, byrow = TRUE)
      )
    }
    ggplot(iris, 
           mapping = aes(Sepal.Length = Sepal.Length,
                         Sepal.Width = Sepal.Width,
                         Petal.Length = Petal.Length,
                         Petal.Width = Petal.Width,
                         colour = Species)) +
      geom_path(alpha = 0.2, stat = "dotProduct", transform = tukey)  + 
      coord_serialaxes()
    ```
    
  <font size="1"> 
  Note that: Tukey's suggestion, element $\ve{a}_t$ can "cover" more spheres in $p$ dimensional space, but it is not orthonormal. 
  </font>
  
### An alternative way to create a serial axes plot

Rather than calling function `coord_serialaxes`, an alternative way to create a serial axes object is to add a `geom_serialaxes_...` object in our model.

For example, Figure 1 to 4 can be created by calling

```{r geom_serialaxes_ objects, eval = FALSE}
g <- ggplot(iris, 
            mapping = aes(Sepal.Length = Sepal.Length,
                          Sepal.Width = Sepal.Width,
                          Petal.Length = Petal.Length,
                          Petal.Width = Petal.Width,
                          colour = Species))
g + geom_serialaxes(alpha = 0.2)
g + 
  geom_serialaxes(alpha = 0.2) + 
  geom_serialaxes_hist(mapping = aes(fill = Species), alpha = 0.2)
g + 
  geom_serialaxes(alpha = 0.2) + 
  geom_serialaxes_density(mapping = aes(fill = Species), alpha = 0.2)
# radial axes can be created by 
# calling `coord_radial()` 
# this is slightly different, check it out! 
g + 
  geom_serialaxes(alpha = 0.2) + 
  geom_serialaxes(alpha = 0.2) + 
  coord_radial()
```

Figure 5 and 7 can be created by setting "stat" and "transform" in `geom_serialaxes`; to Figure 6, `geom_serialaxes_quantile` can be added to create a serial axes quantile layer.

Some slight difference should be noticed here:

* One benefit of calling `coord_serialaxes` rather than `geom_serialaxes_...` is that `coord_serialaxes` can accommodate duplicated axes in mapping aesthetics (e.g. *Eulerian path*, *Hamiltonian path*, etc). However, in `geom_serialaxes_...`, duplicated axes will be omitted. 

* Meaningful axes labels in `coord_serialaxes` can be created automatically, while in `geom_serialaxes_...`, users have to set axes labels by  `ggplot2::scale_x_continuous` or  `ggplot2::scale_y_continuous` manually.

* As we turn the serial axes into interactive graphics (via package [loon.ggplot](https://great-northern-diver.github.io/loon.ggplot/)), serial axes lines in `coord_serialaxes()` could be turned as interactive but in `geom_serialaxes_...` all objects are static. 

```{r benefits of coord_serialaxes, eval=FALSE}
# The serial axes is `Sepal.Length`, `Sepal.Width`, `Sepal.Length`
# With meaningful labels
ggplot(iris, 
       mapping = aes(Sepal.Length = Sepal.Length,
                     Sepal.Width = Sepal.Width,
                     Sepal.Length = Sepal.Length)) + 
  geom_path() + 
  coord_serialaxes()

# The serial axes is `Sepal.Length`, `Sepal.Length`
# No meaningful labels
ggplot(iris, 
       mapping = aes(Sepal.Length = Sepal.Length,
                     Sepal.Width = Sepal.Width,
                     Sepal.Length = Sepal.Length)) + 
  geom_serialaxes()
```

Also, if the dimension of data is large, typing each variate in mapping aesthetics is such a headache. Parameter `axes.sequence` is provided to determine the axes. For example, a `serialaxes` object can be created as

```{r axes.sequence, eval=FALSE}
ggplot(iris) + 
  geom_path() + 
  coord_serialaxes(axes.sequence = colnames(iris)[-5])
```

At very end, please report bugs [here](https://github.com/z267xu/ggmulti/issues). Enjoy the high dimensional visualization! "Don't panic... Just do it in 'serial'" [@inselberg1999don].

## Reference