Search code examples
rstatisticscluster-analysisk-means

How to measure performance of K-Means cluster in R? [image & code included]


I am currently doing a K-means cluster analysis for some customer data at my company. I want to measure the performance of this cluster, I just don't know the library packages used to measure performance of it and I am also unsure if my clusters are grouped too close together.

The data feeding my cluster is a simple RFM (recency, frequency, & monetary value). I also included average order value per transaction by customer. I used the elbow method to determine the optimal number clusters to use. Data consists of 1400 customers and 4 metric values.

Attached is also an image of the cluster plot & R Code

Here is my clustering code:

drop = c('CUST_Business_NM')

#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
glimpse(new_cluster_data)

#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
wss
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
View(kmeans_test$cluster)

#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data,  show.clust.cent = TRUE, geom = c("point", "text"))

Solution

  • You probably do not want to measure the performance of cluster but the performance of the cluster algorithm, in this case kmeans.

    First, you need to be clear what cluster distance measure you want to use. The result of the cluster computation is a dissimilarity matrix, thus the choice of the distance measure is critical, you can play with euclidean, manhattan, any kind of correlation or other distance measure, e.g., like this:

    library("factoextra")
    dis_pearson <- get_dist(yourdataset, method = "pearson")
    dis_pearson
    fviz_dist(dis_pearson)
    

    This will give you the distance matrix and visualize it.

    The output of kmeans has several bits of information. The most important with regard to your question are:

    • totss: the total sum of squares
    • withinss: vector of within-cluster sum of squares
    • tot.withinss: total within-cluster sum of squares
    • betweenss: the between-cluster sum of squares

    Thus, the goal is to optimize these by playing with distances and other methods to cluster the data. Using cluster package, you can simply extract these measures by mycluster <- kmeans(yourdataframe, centers = 2) and then calling mycluster.

    Side comment: kmeans requires the number of clusters defined by the user (additional effort) and it is very sensitive to outliers.