\input{head.tex}

%\VignetteIndexEntry{Introduction to streamMOA}

<<echo=FALSE>>=
options(width = 75, digits = 3, prompt = 'R> ')
@

\section{Introduction}
Please refer to the vignette in package~\pkg{stream} for an introduction
to data stream mining in \proglang{R}. In this vignette we give two examples
that show how to use the \pkg{stream} framework being used from start to finish.
The examples encompasses the creation of data streams, preparation of
data stream clustering algorithms, the online
clustering of data points into micro-clusters, reclustering and finally
evaluation.
The first example shows how compare a set of data stream clustering
algorithms on a static data set. The second example shows how to perform
evaluation on a data stream with concept drift (clusters evolve over time).

\section{Experimental Comparison on Static Data} \label{examples:full}

First, we set up a static data set.
We extract 1500 data points from the Bars and Gaussians data stream generator
with 5\% noise and put them in a \code{DSD_Memory}. The wrapper is used
to replay the same part of the data stream for each algorithm.
We will use the first
1000 points to learn the clustering and the remaining 500 points for
evaluation.

<<echo=FALSE>>=
set.seed(1234)
@

<<data_bng, fig=TRUE, include=FALSE>>=
library("streamMOA")
@


<<data_bng, fig=TRUE, include=FALSE>>=
stream <- DSD_BarsAndGaussians(noise=0.05) %>% DSD_Memory(n = 5500)
stream
plot(stream)
@

\begin{figure}
\centering
\includegraphics[width=.5\linewidth]{streamMOA-data_bng}
\caption{Bar and Gaussians data set.}
\label{figure:data_bng}
\end{figure}

Figure~\ref{figure:data_bng} shows the structure of the data set.
It consists of four clusters,
two Gaussians and two uniformly filled rectangular clusters. The Gaussian and the
bar to the right have $1/3$ the density of the other two clusters.

We initialize a k-means on a sample and multiple clusterers from \pkg{streamMOA}.
We choose the parameters
experimentally so that the algorithm produce each (approximately) 100 micro-clusters.

<<>>=
algorithms <- list(
  'Sample + k-means' = DSC_TwoStage(micro = DSC_Sample(k = 100),
                                    macro = DSC_Kmeans(k = 4)),
  'DenStream'     = DSC_DenStream(epsilon = .5, mu = 1),
  'cluStream'     = DSC_CluStream(m = 100, k = 4),
  'Bico'          = DSC_BICO_MOA(Cluster = 4, Dimensions = 2, MaxClusterFeatures = 100)
)
@

We store the algorithms in a list for easier handling and then cluster the same
1000 data points with each algorithm. Note that we have to reset the stream
each time before we cluster.

<<>>=
for (a in algorithms) {
  reset_stream(stream)
  update(a, stream, 1000)
}
@

We use \code{nclusters()} to inspect the number of micro-clusters.
<<>>=
sapply(algorithms, nclusters, type = "micro")
@

All algorithms except DenStream produce around 100 micro-clusters.
We were not able to adjust DenStream to produce more than around 50 micro-clusters
for this data set.

To inspect micro-cluster placement, we plot the calculated micro-clusters and the
original data.

<<microclusters, fig=TRUE, include=FALSE, width=8, height=12>>=
op <- par(no.readonly = TRUE)
layout(mat = matrix(1:4, ncol = 2))
for (a in algorithms) {
  reset_stream(stream)
  plot(a, stream, main = description(a), type = "micro")
}
par(op)
@

\begin{figure}
\centering
\includegraphics{streamMOA-microclusters}
\caption{Micro-cluster placement for different data stream clustering
algorithms.}
\label{figure:microclusters}
\end{figure}

Figure~\ref{figure:microclusters} shows the micro-cluster placement by
the different algorithms. Micro-clusters are shown as red circles and
the size is proportional to each cluster's weight.
Reservoir sampling and the sliding window randomly place the micro-clusters
and also a few noise points (shown as grey dots).
Clustream and BICO also do not suppress noise and
places even more micro-clusters on noise points since it tries to
represent all data as faithfully as possible. DenStream
suppresses noise and concentrate the micro-clusters on the real clusters. DenStream
produces one heavy micro-cluster on one cluster, while using a large number of
micro clusters for the others. It also has problems with detecting the
rectangular low-density cluster.

It is also interesting to compare the assignment areas for micro-clusters
created by different algorithms. The assignment area is the area around
the center of a micro-cluster in which points are considered to belong to
the micro-cluster. In case that a point is in the assignment area of several
micro-clusters, the closer center is chosen.
To show the assignment area we add \code{assignment = TRUE} to plot.
We also disable showing micro-cluster weights to make the plot clearer.

<<microclusters_assignment, fig=TRUE, include=FALSE, width=8, height=12>>=
op <- par(no.readonly = TRUE)
layout(mat = matrix(1:4, ncol = 2))
for (a in algorithms) {
  reset_stream(stream)
  plot(
    a,
    stream,
    main = description(a),
    assignment = TRUE,
    weight = FALSE,
    type = "micro"
  )
}
par(op)
@

\begin{figure}
\centering
\includegraphics{streamMOA-microclusters_assignment}
\caption{Micro-cluster assignment areas for different data stream clustering
algorithms.}
\label{figure:microclusters_assignment}
\end{figure}

Figure~\ref{figure:microclusters_assignment} shows the assignment areas as dotted
circles around micro-clusters. Not all algorithms provide assignment ares.

<<>>=
sapply(
  algorithms,
  FUN = function(a) {
    reset_stream(stream, 1001)
    evaluate_static(
      a,
      stream,
      measure = c("numMicroClusters", "purity", "SSQ", "silhouette"),
      n = 500,
      assignmentMethod = "auto",
      type = "micro"
    )
})
@

We need to be careful with the comparison of these numbers, since the depend
heavily on the number of micro-clusters with more clusters leading to a
better value. Therefore, a comparison with DenStream
is not valid.
We can compare the measures, of the other algorithms since the number
of micro-clusters is close.
Sampling produces very good values
for purity, BICO achieves the highest average
silhouette coefficient and CluStream produces the lowest sum of squares.
For better results more data and cross-validation could be used.

\section{Outlier detection}

To support the outlier detection area, \pkg{streamMOA} contains a wrapper to the MOA
implementation of the Micro-cluster Continuous Outlier Detector (MCOD). To demonstrate
a synergy of outlier detection capabilities between \pkg{stream} and \pkg{streamMOA}
packages, we bring two basic examples.
First, we create a fixed data stream with noise outliers that are well separated from the
clusters using Mahalanobis distance.
<<outlier1>>=
library(stream)
set.seed(1000)
stream <- DSD_Gaussians(k = 3, d = 2,
            variance_limit = c(0.1, 1),
            space_limit = c(0, 30),
            noise = .01,
            noise_limit = c(0, 30),
            noise_separation = 6,
            separation_type = "Mahalanobis"
  ) %>% DSD_Memory(n = 1000)
@

<<outlier2, fig=TRUE, include=FALSE>>=
plot(stream, n = 1000)
@
The generated stream can be seen in Figure~\ref{figure:outlier_points}.
\begin{figure}
\centering
\includegraphics[width=.5\linewidth]{streamMOA-outlier2}
\caption{Data points from \code{DSD\_Gaussians} having 3 clusters and 1\% is outliers (noise).}
\label{figure:outlier_points}
\end{figure}

Then we define a \code{DSC_MCOD} clusterer. Since this is a single-pass clusterer
\code{DSC_SinglePass}, we do not need to update the model first, we can immediately
call evaluation.
<<outlier3>>=
reset_stream(stream)
mic_c <- DSOutlier_MCOD(r = 2, w = 1000)
evaluate_static(
  mic_c,
  stream,
  n = 1000,
  type = "micro",
  measure = c("crand", "outlierjaccard")
)
@

<<outlier4, fig=TRUE, include=FALSE>>=
reset_stream(stream)
plot(mic_c, stream, n = 1000)
@

\begin{figure}
\centering
\includegraphics[width=.45\linewidth]{streamMOA-outlier4}
\caption{MCOD outlier detection.}
\label{figure:outlier_mcod1}
\end{figure}

In Figure~\ref{figure:outlier_mcod1} we can see micro-cluster and outlier assignments
for the generated data stream.

\input{foot.tex}
