My objective is to compare which of the two clustering methods I've used cluster_method_1
and cluster_method_2
has the largest between cluster sum of squares in order to identify which one achieves better separation.
I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.
example dataframe:
structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848,
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376,
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849,
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849,
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684,
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684,
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328,
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861,
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L,
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4",
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L,
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4",
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
x1 x2 x3 x4 cluster_method_1 cluster_method_2
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 0.0176 0.536 0.647 -0.727 1 5
2 -1.15 0.886 -0.604 0.727 4 3
3 1.07 -0.945 -0.604 0.727 4 1
4 2.00 -2.24 -0.604 -0.727 4 3
5 0.0176 -1.81 -0.604 -0.727 2 4
6 0.944 1.83 0.647 0.727 2 2
7 2.00 0.886 -0.604 0.727 4 1
8 0.0176 0.969 -0.604 -0.727 2 1
9 -1.15 0.187 -0.604 -0.727 1 1
10 -1.15 -0.679 1.62 -0.727 6 6
The within sum-of-squares for cluster Si can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)
For convenience we define a function calc_SS
that returns the within sum-of-squares for a (numeric) data.frame
calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method
library(tidyverse)
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
spread(method, within_SS)
## A tibble: 6 x 3
# cluster cluster_method_1 cluster_method_2
# <chr> <dbl> <dbl>
#1 1 1.52 9.99
#2 2 10.3 0
#3 3 NA 10.9
#4 4 15.2 0
#5 5 NA 0
#6 6 0 0
The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
group_by(method) %>%
summarise(total_within_SS = sum(within_SS)) %>%
spread(method, total_within_SS)
## A tibble: 1 x 2
# cluster_method_1 cluster_method_2
# <dbl> <dbl>
#1 27.0 20.9
By the way, we can confirm that calc_SS
does indeed return the within sum-of-squares using the iris
dataset:
set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster
df2 %>%
group_by(cluster) %>%
nest() %>%
mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
arrange(cluster)
## A tibble: 3 x 3
# cluster data within_SS
# <int> <list> <dbl>
#1 1 <tibble [38 × 4]> 23.9
#2 2 <tibble [62 × 4]> 39.8
#3 3 <tibble [50 × 4]> 15.2
kmeans$within
#[1] 23.87947 39.82097 15.15100