Search code examples
rstringcluster-analysis

How to do K-means clustering on a dataset full of string variables in r


Right now I have a dataset that is full of string variables, but I want to do a clustering project on that. After I apply as.factor() to all the variables, nbclust() still could not work, what am I suppose to do?


Solution

  • K-means typically uses Euclidean distances (see e.g. https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric) so you can't directly "cluster on words".

    If you want to cluster observations based on words, you have to generate numbers (e.g. k-means for text clustering) For example if you were trying to cluster customer profiles to do segmentation, you could count up words representing their interests in their profiles, and then have one column per interest, and count the number of times that word or n-gram appeared in the profile, then cluster on that matrix of numbers. Or in clustering documents, generate a term-document matrix (or document-term matrix, or term-term occurrence like k-means clustering on term-term co-ocurrence matrix) and use those numbers for clustering.