---
title: "Tutorial on the snha package"
shorttitle: "snha package tutorial"
author: Detlef Groth, University of Potsdam, Germany
date: 2023-03-05
abstract: >
    The *snha* package provides easy to use R functions to apply the St.
    Nicolas House Analysis to your data. The algorithm traces associations chains between
    interactiving variables. The algorithm was described recently by Groth et.
    al. (2019) @Groth2019 and more detailed by Hermanussen et. al. (2021) @Hermanussen2021.
    In this package vignette the basic workflow for analyzing your and raw data and as well for 
    analysing precomputed correlation matrices is demonstrated.
bibliography: bibliography.bib
csl: nature-biotechnology.csl

output: function () { rmarkdown::html_vignette(toc=TRUE,css="style.css") }
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Tutorial on the snha package}
  %\usepackage[utf8]{inputenc}
---

## Introduction

The package `snha` explores interacting variables by searching association
chains where correlation coefficients between variables drop in a regular order
between a set of variables. The package can be used by calling the function
`snha` with your data, where the columns must be your variables. The return
value is an object of class `snha` which can be visualized using a plot
function. The details of the analysis can be inspected by looking at the
internal variables of this object. Below follows a minimal analysis for the `birthwt`
data from the _MASS_ R package. The variables are:

* _age_ - mother's age in years.
* _lwt_ - mother's weight in pounds at last menstrual period.
* _race_ - mother's race (1 = white, 2 = black, 3 = other).
* _smoke_ - smoking status during pregnancy (0 = no, 1 = yes).
* _ptl_ - number of previous premature labours.
* _ht_ - history of hypertension (0 = no, 1 = yes).
* _ui_ - presence of uterine irritability (0 = no, 1 = yes).
* _ftv_ - number of physician visits during the first trimester.
* _bwt_ - birth weight of child in grams.

Let's start with the data preparation. For illustrative purposes we add
a random data vector as well:

```{r label=ddataprep}
set.seed(125)
# retrieve the data
library(MASS)
data(birthwt)
birthwt$low=NULL 
# remove column for the low indicator 
# which is 1 i a child has low birtwt
rnd=round(rnorm(nrow(birthwt),mean=10,sd=2),2)
# rnd just contains random data
birthwt=cbind(birthwt,rnd=rnd) # adding it
head(birthwt)
```

OK, we are ready to go: We added the random data column `rnd` and removed the
redundant column `low` which indicated low birth weight, for this we have 
the `bwt` column in the data set. So we do not need a redundant variable.

Let's now first start for illustrative purposes with a PCA and then with our
SNHA method where we use Spearman correlation as it is more robust against
outliers than Pearson correlation, we set the p-value threshold, _alpha_ to 0.1
as the algorithm is very resistant against the detection of spurious
correlations.

```{r label=plotsnha, fig.width=10,fig.height=5,fig.cap="Birth weight data variable interactions"}
opar=par(mfrow=c(1,2),mai=c(0.8,0.8,0.1,0.2))
library(snha)
# retrieve some data
pca=prcomp(t(scale(birthwt)))
summary(pca)
plot(pca$x[,1:2],xlab='PC1', ylab='PC2',pch=19,cex=5,col='salmon')
text(pca$x[,1:2],colnames(birthwt))
text(-5,-10,"PCA",cex=2) 
as=snha(birthwt,method="spearman",alpha=0.1)
par(mai=c(0.8,0.2,0.1,0.2))
plot(as,layout="sam",vertex.size=7,lwd=3,edge.width=3)
text(-1.5,-1.8,"SNHA",cex=2)
box()
par(opar)
```

In the PCA plot on the left, the most import variables, having high values in
the first component, are on the left and right borders of the plot, unimportant
variables are in the center, negatively associated deeply interacting variables
such as birthweight (bwt) and premature labours (ptl) are on opposite sides of
the plot. These characteristics of the PCA plot make it hart to follow the
variable relations.  In contrast the variables in the SNHA graph on the right
show immediately logical interactions, the birth weight is positively
associated to mothers last weight, and negatively to smoking, premature labours
and uterine irritability, white people smoke more and white mothers visit more
often physicians ... The older the mother the more visits at physicians and
hypertension is positively associated with weight of the mother.

What are the R-square values, the prediction power for every node based on linear
models and what are the connections between the variables stored in the
adjacency matrix 'theta':

```{r}
round(snha_rsquare(as),2)
as$theta
```

It can be seen that the overall strength of the association is very small,
largest r-square value is 0.18 for smoke, 0.13 for birthweight (bwt) but still
the analysis show reasonable results without having the necessity of finding
some optimal threshold.


## Decathlon data

Here is an other example where we analyze the relationship between the different
decathlon disciplines with athletes taking part in the 1988 Olympics and which had
results above 7000 points. We perform the St. Nicolas House Analysis and
later check the average R-square values for the each node.

```{r label=dec,fig.width=9,fig.height=6,fig.cap="SNHA - Decathlon Data 1988"}
### data loading
data(decathlon88)
head(decathlon88)
A=snha(decathlon88,method="spearman",alpha=0.1)
cols=rep("salmon",10)
cols[names(A$data) %in% c("jave","shot","disc","pole")]="skyblue"
plot(A,layout="sam",vertex.color=cols,vertex.size=8,cex=1.1,edge.width=5)
snha_rsquare(A)
mn=mean(snha_rsquare(A))
title(paste("R-square = ",round(mn,2)))
```

As you can see the variables nicely separates between disciplines related to
the upper part of the body (blue) and disciplines where the legs do most of the
work (salmon). The mostly hated 1500m run is negatively associated to the
throwing disciplines. The running distances are building a chain
_100-400-1500m_ as expected and the jump disciplines are close to each other
high-jump (high), hurdles (X110), long jump (long) and pole. As you can see the
variables are just in their logical order. 

```{r}
round(A$sigma,2)
round(A$p.value,3)
```

For illustrative purposes create a graph with the same layout but with edges
showing all significant correlations.

```{r label=dec2,fig.width=9,fig.height=6,fig.cap="Decathlon Data 1988 (p-value Graph)"}
B = A$theta
B[]=0
B[A$p.value<0.05]=1
diag(B)=0
plot.snha(B,layout='sam',vertex.color=cols,vertex.size=8,cex=1.1,edge.width=5)
```

As you can see the major relationships are the same, but there are a few more
edges which did however not enhance the overall data structure. In case of
really interacting variables it would be as well difficult to distinguish
between direct and indirect associations, as the latter can be as well very
easily become significant if the primary interaction is highly significant.

## Swiss dataset example

Let's finish with an other data set, the `swiss` data which are available in
every R installation. Here we try out both the correlation methods, Spearman
and Pearson correlation. We use the function `snha_layout` to determine
a layout matrix which we will then reuse for both plots.

```{r label=plot,fig.width=10,fig.height=5,fig.cap="Swiss data variable associations"}
library(snha)
data(swiss)
head(swiss,4)
### shorter names useful for display later in the graph
colnames(swiss)=abbreviate(colnames(swiss))
head(swiss,4)
opar=par(mfrow=c(1,2))
### options(warn=-1)
as=snha(swiss,method="pearson")
### store layout for reuse in two graphs
lay = snha_layout(as,mode="sam")
plot(as,layout=lay,vertex.size=8,main="Pearson")
as=snha(swiss,method="spearman")
plot(as,layout=lay,vertex.size=8,main="Spearman")
par(opar)
```

Here is the resulting adjacency matrix:

```{r label=theta,results='asis'}
knitr::kable(as$theta)
```

As you can see the structure remains the same, but Pearson correlation shows
more edges, we should check if the data are normally distributed.  Again,
without playing around with some parameters or thresholds we get immediately
the general associations between the data.  Let's just check if the data are
normally distributed and then conclude if we should use Spearman correlation
for non-normally distributed data or Pearson correlation for normally
distributed data:

```{r, result="asis"}
### prepare a test returning only p-values
mtest = function (x) { return(shapiro.test(x)$p.value)  }
df=data.frame(orig=round(apply(swiss,2,mtest),3))
df=cbind(df,log2=round(apply(log2(swiss),2,mtest),3))
knitr::kable(df)
```

As you can see, both with the original data and as well with the log-normalized
data the Shapiro-Wilk test has a few significant entries, so we reject the
Null-hypothesis that these data are coming from a normal distribution. So for
our example using the `swiss` data we should very likely prefer using the
Spearman correlation. 

## Plotting

The plotting of the graph can be changed in various ways, for details see
`?plot.snha`. Here I just give a few examples. As the graph is generated based
on the underlying pairwise correlations, it might be useful to display the
pairwise correlation either in a correlation plot or by adding the correlations
values on the edges of the graph. Here an example where we do first
a correlation plot and then a plot of the SNHA graph overlaying the edges with the correlation values.

```{r label=corrplot,fig.width=10,fig.height=5,fig.cap="Correlation and Network plot with correlation values on the edges"}
opar=par(mfrow=c(1,2),mai=rep(0.2,4))
sw=snha(swiss,method="spearman",alpha=0.1)
plot(sw,type="corrplot")
plot(as,edge.text=round(as$sigma,2),edge.pch=15,layout='sam')
par(opar)
```

## Log-Likelihood

The edge quality can be judged either by the log-likelihood ratio for the
individual chains or by bootstrapping where we look how often a certain chain
was found if we do re-samplings with our data set.

Let's first  calculate the log-likelihoods for the different chains which were
found. We can see the underlying chains either directly using the internal
object chains or by using the `snha_get_chains` method which returns a data
frame:

```{r label=chains}
snha_get_chains(as)
```  

The `m` in front of a chain name indicated that the chain was found by
investigating the variable to be in the middle of a chain, the `a` indicated
that the chain was at the beginning of the investigated chain. For details on
the algorithm have a look at Hermanussen et. al. (2021[@Hermanussen2021]).

The log-likelihood for these chains can be calculated using the function
`snha_ll` like this:

```{r label=ll}
snha_ll(as)
```

The relevant p-values are in the last column, if the p-value is higher than
0.05 we can assume that the chain is sufficient to capture the dependency
between the variables of the chain. Here for the chain 2 and 3 this is the
case, whereas for the first chain the p-value is very low indicating that the
chain is not sufficient to capture the variable dependencies. One reason might
be that we used Spearman correlation to create the graph whereas log-likelihood assumes linear dependencies.

## Bootstrapping

Another approach to evaluate the quality of chains and edges is bootstrapping.
We sample several times items from the data set with replacement and we redo
thereafter the analysis with each of the samples. Edges which appear only very
rarely are less likely to be of importance and significance. 

Let's use an example:

```{r label=boot,fig.width=9,fig.height=4,fig.cap="Boostrap Example"}
opar=par(mfrow=c(1,2),mai=c(0.1,0.1,0.7,0.1))
as.boot=snha(swiss,method="spearman",prob=TRUE)
lay=snha_layout(as.boot,method="sam")
plot(as,layout=lay,vertex.size=6,main="Single Run")
plot(as.boot,layout=lay,vertex.size=6,main="Bootstrap Run")
par(opar)
```

Solid lines shown in the graph above indicate that edges where found in more
than 75 percent of all re-samplings, broken lines indicate edges appearing in
more than 50% of all re-samplings and dotted lines in 25-50% of all
re-samplings. 

As you can see the bootstrap method does find a few more edges than the single
run variation of the `snha` method. If you network is not too large it is
usually recommended to use bootstrapping to get more insights into the edge
quality and to get as well edges if the network is more dense and has a lot of
highly connected nodes.

## Creating your own data

In order to test the algorithm there is as well in the package a function which
allows you to generate data for directed and undirected graphs, either using
the given adjacency matrix as precision matrix or using a Monte Carlo
simulation as described by Novine et. al (2021 @Novine21). Here an example:

```{r werner}
W=matrix(0,nrow=6,ncol=6,dimnames=list(LETTERS[1:6],LETTERS[1:6]))
W[1:2,3]=1
W[3,4]=1
W[4,5:6]=1
W[5,6]=1
W
```

For such an adjacency matrix we can create data like this:

```{r}
data=snha_graph2data(W)
dim(data)
round(cor(t(data)),2)
```

As you can see the correlations follow the given graph, we can as well plot
these for better illustration:

```{r label=wplot,fig.width=8,fig.height=3,out.width=900,fig.cap="True graph, correlations and predicted graph (left to right)"}
opar=par(mfrow=c(1,3),mai=rep(0.2,4))
plot.snha(W)
plot.snha(cor(t(data)),type="cor")
plot.snha(snha(t(data)))
par(opar)
```


## Installation

As long as the package is not yet on the CRAN repository the package can be
usually installed using the submitted `tar.gz` archive with the following
commands:

```
library(tcltk)
pkgname=tclvalue(tkgetOpenFile(
    filetypes="{{Tar.gz files} {*.tar.gz}} {{All files} {*.*}}"))
if (pkgname != "") {
    install.packages(pkgname,repos=NULL)
}
```

It is as well possible to install the latest version directly from the Github
repository like this:

```
library(remotes)
remotes::install_github("https://github.com/mittelmark/snha")

```

Thereafter you can check the installation like this:

```
library(snha)
citation("snha")
```

## Background Details and Concept

Analyzing multivariate data is often done using visualization of pairwise
correlations, using principal component analysis or multidimensional scaling as
typical methods in this area. The `snha` package provides an alternative
approach, by uncovering ordered sequences of correlation coefficients which
can be reversed [@Groth2019] [@Hermanussen2021]. Existing chains are translated
into edges between the variables, here taken as nodes of a graph. The graph can
be then visualized and the major relations between the variables are visible.

The basic assumption of the method is the assumption that correlations coefficients between two variables, where one variable directly influences the other,
are larger than those of secondary associations. So for instance if we assume
that a variable _A_ influences a variable _B_, and _B_ influences _C_, it can be
assumed, that _r(AB)_ > _r(AC)_ and that in the opposite direction _r(CB)_ > _r(CA)_.

The algorithm provided in the `snha` package uncovers such association chains
where the order of correlation coefficient can be reversed. The advantage of
the method is that there is only a very limited requirement for choosing
thresholds for instance for the p-value or for the correlation coefficient.
The reason is that the existence of such association chains with the correct
ordering of three or more nodes is much less likely to exists by accident then
significant pairwise correlations.

In the following we will first illustrate the concept on a simple hypothetical
association chain and thereafter you might again study the real world examples
at the beginning of this vignette with more understanding.

## Simple association chain

Let's assume we have a simple association chain where a variable _A_ is
influencing a variable _B_, _B_ is influencing a variable _C_ and _C_ is influencing
variable _D_ like this: 

```{r label=start,fig.width=6,fig.height=1.7,fig.cap="An association chain"}
opar=par(mai=c(0.1,0.1,0.1,0.0))
plot(1,xlab="",ylab="",axes=FALSE,type="n",xlim=c(0.5,4.5),ylim=c(0.8,1.2))
arrows(1:3,rep(1,3),1:3+0.8,rep(1,3),lwd=3,length=0.1)
points(1:4,rep(1,4),pch=19,col="salmon",cex=6)
text(1:4,1,LETTERS[1:4],cex=2)
par(opar)
```

In this situation we can assume that, despite of the omnipresent noise in such
situation, the correlations of directly interacting variables is higher in
comparison to variables only connected only via other variables. Let's assume
for simplicity reasons, that the correlation between directly connected
variables drops down from r=0.7 to around r=0.5 for secondary connected
variables and r=0.3 for tertiary connected variables. So a possible correlation
matrix could look like this:


```{r result='asis'}
C=matrix(c(1,0.7,0.5,0.3,
           0.7,1,0.7,0.5,
           0.5,0.7,1,0.7,
           0.3,0.5,0.7,1),
           nrow=4,byrow=TRUE)
rownames(C)=colnames(C)=LETTERS[1:4]           
knitr::kable(C)
```

Let's now add a little bit of noise and visualize the pairwise correlations
using the plot function of the `snha` package.

```{r label=corplot,fig.width=6,fig.height=3,fig.cap="Visualization of correlation matrix sigma and the adjacency matrix theta"}
set.seed(123)
opar=par(mfrow=c(1,2),mai=c(0.1,0.1,0.1,0.1))
C=C+rnorm(length(C),mean=0,sd=0.1)
C[lower.tri(C)]=t(C)[lower.tri(C)]
diag(C)=1
as=snha(C)
round(as$sigma,3)
plot(as,type="corplot")
as$theta
plot(as)
par(opar)
```

As we can see, the correlations are now slightly altered. A simple
*r* threshold mechanism, for instance taking only correlations larger than 0.5
into consideration would as well have false positive edges like between the nodes B and D. The
function `snha` takes as input either a correlation matrix or a data matrix
or data.frame and tries to find such association chains. The association chain
is stored in the internal object theta and can be visualized using the default
plot command.


## Summary

Here are the functions to be used by the normal user of the package:

* _snha_ - create a snha graph object
* _plot_ - plot a snha graph object
* _as.list_ - create a list out of a snha graph object, ready to write for
  instance into an Excel file  
* *snha_get_chains* - get the actual chains which were found and which build the graph
* *snha_graph2data* - generate for a given adjacency matrix some data    
* *snha_rsquare* - get r-square values for the nodes based on linear model to have 
  a qualitative measure for the graph prediction. 


The snha graph object contains a few internal variables which might be of interest for the user:

* _alpha_ - the chosen p-value threshold
* _chains_ - the found association chains
* _data_ - the input data
* _method_ - the correlation method
* _p-values_ the pairwise p-values
* _probabilities_ - in case of bootstrapping the proportion how often a chain was found
* _theta_ - the adjacency matrix for the nodes / variables

## Build information

The package was build using `r R.version.string` on `r  R.version$platform` using snha package `r packageVersion('snha')`.

```{r}
print(sessionInfo())
```

## References