Search code examples
rcluster-analysis

Calculate total sum of squares between clusters in R


My objective is to compare which of the two clustering methods I've used cluster_method_1 and cluster_method_2 has the largest between cluster sum of squares in order to identify which one achieves better separation.

I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.

example dataframe:

structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848, 
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376, 
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849, 
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849, 
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684, 
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684, 
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328, 
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861, 
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L, 
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4", 
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L, 
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4", 
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))



        x1     x2     x3     x4 cluster_method_1 cluster_method_2
     <dbl>  <dbl>  <dbl>  <dbl> <fct>            <fct>           
 1  0.0176  0.536  0.647 -0.727 1                5               
 2 -1.15    0.886 -0.604  0.727 4                3               
 3  1.07   -0.945 -0.604  0.727 4                1               
 4  2.00   -2.24  -0.604 -0.727 4                3               
 5  0.0176 -1.81  -0.604 -0.727 2                4               
 6  0.944   1.83   0.647  0.727 2                2               
 7  2.00    0.886 -0.604  0.727 4                1               
 8  0.0176  0.969 -0.604 -0.727 2                1               
 9 -1.15    0.187 -0.604 -0.727 1                1               
10 -1.15   -0.679  1.62  -0.727 6                6  

Solution

  • The within sum-of-squares for cluster Si can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)

    enter image description here

    For convenience we define a function calc_SS that returns the within sum-of-squares for a (numeric) data.frame

    calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
    

    It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method

    library(tidyverse)
    df %>%
        gather(method, cluster, cluster_method_1, cluster_method_2) %>%
        group_by(method, cluster) %>%
        nest() %>%
        transmute(
            method,
            cluster,
            within_SS = map_dbl(data, ~calc_SS(.x))) %>%
        spread(method, within_SS)
    ## A tibble: 6 x 3
    #  cluster cluster_method_1 cluster_method_2
    #  <chr>              <dbl>            <dbl>
    #1 1                   1.52             9.99
    #2 2                  10.3              0
    #3 3                  NA               10.9
    #4 4                  15.2              0
    #5 5                  NA                0
    #6 6                   0                0
    

    The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster

    df %>%
        gather(method, cluster, cluster_method_1, cluster_method_2) %>%
        group_by(method, cluster) %>%
        nest() %>%
        transmute(
            method,
            cluster,
            within_SS = map_dbl(data, ~calc_SS(.x))) %>%
        group_by(method) %>%
        summarise(total_within_SS = sum(within_SS)) %>%
        spread(method, total_within_SS)
    ## A tibble: 1 x 2
    #  cluster_method_1 cluster_method_2
    #             <dbl>            <dbl>
    #1             27.0             20.9 
    

    By the way, we can confirm that calc_SS does indeed return the within sum-of-squares using the iris dataset:

    set.seed(2018)
    df2 <- iris[, 1:4]
    kmeans <- kmeans(as.matrix(df2), 3)
    df2$cluster <- kmeans$cluster
    
    df2 %>%
        group_by(cluster) %>%
        nest() %>%
        mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
        arrange(cluster)
    ## A tibble: 3 x 3
    #  cluster data              within_SS
    #    <int> <list>                <dbl>
    #1       1 <tibble [38 × 4]>      23.9
    #2       2 <tibble [62 × 4]>      39.8
    #3       3 <tibble [50 × 4]>      15.2
    
    kmeans$within
    #[1] 23.87947 39.82097 15.15100