---
title: "basic usage"
author: "Aviezer Lifshitz"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{usage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Basic usage of the package.


## Basic usage

```{r, echo = FALSE, message = FALSE}
library(dplyr)
library(ggplot2)
library(tglkmeans)
theme_set(theme_classic())
set.seed(60427)
```

First, let's create 5 clusters normally distributed around 1 to 5, with sd of 0.3:

```{r}
data <- simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2)
data
```

This is how our data looks like:

```{r, fig.show='hold'}
data %>% ggplot(aes(x = V1, y = V2, color = factor(true_clust))) +
    geom_point() +
    scale_color_discrete(name = "true cluster")
```

Now we can cluster it using kmeans++:

```{r}
rownames(data) <- data$id
data_for_clust <- data %>% select(starts_with("V"))
km <- TGL_kmeans_tidy(data_for_clust,
    k = 5,
    metric = "euclid",
    verbose = TRUE
)
```
The returned list contains 3 fields:

```{r}
names(km)
```

`km$centers` contains a tibble with `clust` column and the cluster centers:

```{r}
km$centers
```
clusters are numbered according to `order_func` (see 'Custom cluster ordering' section).

`km$cluster` contains tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster:

```{r}
km$cluster
```

`km$size` contains tibble with `clust` column and `n` column with the number of points in each cluster:

```{r}
km$size
```

We can now check our clustering performance - fraction of observations that were classified correctly:

```{r}
d <- tglkmeans::match_clusters(data, km, 5)
sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust))
```

And plot the results:

```{r, fig.show='hold'}
d %>% ggplot(aes(x = V1, y = V2, color = factor(new_clust), shape = factor(true_clust))) +
    geom_point() +
    scale_color_discrete(name = "cluster") +
    scale_shape_discrete(name = "true cluster") +
    geom_point(data = km$centers, size = 7, color = "black", shape = "X")
```

## Custom cluster ordering
By default, the clusters where ordered using the following function: `hclust(dist(cor(t(centers))))` - hclust of the euclidean distance of the correlation matrix of the centers.

We can supply our own function to order the clusters using `reorder_func` argument. The function would be applied to each center and he clusters would be ordered by the result.

```{r}
km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")),
    k = 5,
    metric = "euclid",
    verbose = FALSE,
    reorder_func = median
)
km$centers
```

## Missing data
tglkmeans can deal with missing data, as long as at least one dimension is not missing. for example:

```{r}
data$V1[sample(1:nrow(data), round(nrow(data) * 0.2))] <- NA
data
```

```{r}
km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")),
    k = 5,
    metric = "euclid",
    verbose = FALSE
)
d <- tglkmeans::match_clusters(data, km, 5)
sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust))
```
and plotting the results (without the NA's) we get:

```{r, fig.show='hold'}
d %>% ggplot(aes(x = V1, y = V2, color = factor(new_clust), shape = factor(true_clust))) +
    geom_point() +
    scale_color_discrete(name = "cluster") +
    scale_shape_discrete(name = "true cluster") +
    geom_point(data = km$centers, size = 7, color = "black", shape = "X")
```

## High dimensions
Let's move to higher dimensions (and higher noise):

```{r}
data <- simulate_data(n = 100, sd = 0.3, nclust = 30, dims = 300)
km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")),
    k = 30,
    metric = "euclid",
    verbose = FALSE,
    id_column = TRUE
)
```

Note that here we supplied `id_column = TRUE` to indicate that the first column is the id column.

```{r}
d <- tglkmeans::match_clusters(data, km, 30)
sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust))
```
## Comparison with R vanilla kmeans 
Let's compare it to R vanilla kmeans:

```{r}
km_standard <- kmeans(data %>% select(starts_with("V")), 30)
km_standard$clust <- tibble(id = 1:nrow(data), clust = km_standard$cluster)

d <- tglkmeans::match_clusters(data, km_standard, 30)
sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust))
```
We can see that kmeans++ clusters significantly better than R vanilla kmeans.

## Random seed
we can set the seed for reproducible results:

```{r}
km1 <- TGL_kmeans_tidy(data %>% select(starts_with("V")),
    k = 30,
    metric = "euclid",
    verbose = FALSE,
    seed = 60427
)
km2 <- TGL_kmeans_tidy(data %>% select(starts_with("V")),
    k = 30,
    metric = "euclid",
    verbose = FALSE,
    seed = 60427
)
all(km1$centers[, -1] == km2$centers[, -1])
```