r machine-learning cluster-analysis unsupervised-learning

k-mode clustering in R returns different cluster sizes with each run

I am using K-Mode clustering to cluster categorical data, but when I cluster the data with the same number of clusters, it return different cluster sizes every time

I was expecting that the cluster sizes will be always fixed if I am running it on the same data and with the same number of clusters

am I doing something wrong?

library(klaR)
mysample=read.csv("sample_to_cluster.csv")
results1 <-kmodes(mysample[,2:ncol(mysample)],3 , iter.max = 50, weighted = FALSE )
results2 <-kmodes(mysample[,2:ncol(mysample)],3 , iter.max = 50, weighted = FALSE )
print(results1$size)
print(results2$size)
#why results1 & results2 don't have the same sizes

this is the CSV file I am using CSV

Solution

see https://stats.stackexchange.com/questions/58238/how-random-are-the-results-of-the-kmeans-algorithm

There is more than one k-means algorithm.

You probably refer to Lloyds algorithm, which only depends on the initial cluster centers. But there also is MacQueen's, which depends on the sequence i.e. ordering of points. Then there is Hartigan, Wong, Forgy,

various implementations may have implementation and optimization differences. They may treat ties differently, too! For example, many naive implementations will always assign elements to the first or last cluster when tied.

Furthermore, the clusters may end up being reordered by memory address after finishing k-means, so you cannot safely assume that cluster 1 remains cluster 1 even if k-means converged after the first iteration. Others will reorder clusters by cluster size (which actually makes sense for k-means, as this will more likely return the same result on different random initialization)

it really depends on what kind of data you have. If it is nicely split into spherical-shaped clusters then you will typically get very very similar clusters. If not, then you might get pretty random clusters each time.

set.seed(1)

Everytime K-Means initializes the centroid, it is generated randomly, which is needing seed for generating random values.