Search code examples
algorithmcluster-analysisk-meanstf-idfunsupervised-learning

K-Means VS K-Modes? (text clustering)


I understand that K-Means can be used to cluster documents by vectorizing and finding their TF-IDF values. When/how do we decide which one (K-Means or K-modes) might yield better results, apart from the categorical/continuous variables definition? Does one really give better results or is it case-by-case basis?

I have carried out KMeans clustering using tf-idf and they seem to give decent results, but I can't find any material comparing the two to venture out into K-Modes. Also there is so much on the internet on k-means+tf-idf for text clustering, not much on k-modes. Any help is appreciated!


Solution

  • K-modes is really only applicable for categoricial data. Not for sparse numerical data like bag-of-words or tf-idf vectors.

    Consider the mode: wouldn't it usually give all-zeros vectors? Then all your cluster means will disappear.

    In my experience, k-means on text also works very bad except on you data. Because it can't handle outliers and text data is full of outlier documents.