Calculate total sum of squares between clusters in R

My objective is to compare which of the two clustering methods I've used cluster_method_1 and cluster_method_2 has the largest between cluster sum of squares in order to identify which one achieves better separation.

I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.

example dataframe:

structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848, 
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376, 
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849, 
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849, 
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684, 
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684, 
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328, 
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861, 
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L, 
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4", 
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L, 
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4", 
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))



        x1     x2     x3     x4 cluster_method_1 cluster_method_2
     <dbl>  <dbl>  <dbl>  <dbl> <fct>            <fct>           
 1  0.0176  0.536  0.647 -0.727 1                5               
 2 -1.15    0.886 -0.604  0.727 4                3               
 3  1.07   -0.945 -0.604  0.727 4                1               
 4  2.00   -2.24  -0.604 -0.727 4                3               
 5  0.0176 -1.81  -0.604 -0.727 2                4               
 6  0.944   1.83   0.647  0.727 2                2               
 7  2.00    0.886 -0.604  0.727 4                1               
 8  0.0176  0.969 -0.604 -0.727 2                1               
 9 -1.15    0.187 -0.604 -0.727 1                1               
10 -1.15   -0.679  1.62  -0.727 6                6

Solution

The within sum-of-squares for cluster S_i can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)

For convenience we define a function calc_SS that returns the within sum-of-squares for a (numeric) data.frame

calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))

It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method

library(tidyverse)
df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    spread(method, within_SS)
## A tibble: 6 x 3
#  cluster cluster_method_1 cluster_method_2
#  <chr>              <dbl>            <dbl>
#1 1                   1.52             9.99
#2 2                  10.3              0
#3 3                  NA               10.9
#4 4                  15.2              0
#5 5                  NA                0
#6 6                   0                0

The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster

df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    group_by(method) %>%
    summarise(total_within_SS = sum(within_SS)) %>%
    spread(method, total_within_SS)
## A tibble: 1 x 2
#  cluster_method_1 cluster_method_2
#             <dbl>            <dbl>
#1             27.0             20.9

By the way, we can confirm that calc_SS does indeed return the within sum-of-squares using the iris dataset:

set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster

df2 %>%
    group_by(cluster) %>%
    nest() %>%
    mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    arrange(cluster)
## A tibble: 3 x 3
#  cluster data              within_SS
#    <int> <list>                <dbl>
#1       1 <tibble [38 × 4]>      23.9
#2       2 <tibble [62 × 4]>      39.8
#3       3 <tibble [50 × 4]>      15.2

kmeans$within
#[1] 23.87947 39.82097 15.15100