Search code examples
cluster-analysisweka

Clustering texts in weka


I am trying to cluster a text data (post contents in forum + users) 817 instances in weka using simpleKmeans. for some reason the clustering goes like this:

Clustered Instances

0      812 ( 99%)
1        1 (  0%)
2        1 (  0%)
3        1 (  0%)
4        1 (  0%)
5        1 (  0%)

Could someone explain to me why I m not getting the clustering evenly?


Solution

  • K-means doesn't guarantee even clusters. (There is a tutorial on how to modify k-means to produce even-sized clusters; but that won't solve your problems).

    k-means is quite sensitive to outliers. In the presence of outliers, it's fairly common to see "outlier clusters" that consist of a single point only. Which is probably what you are observing.

    But more than that, k-means also doesn't work well with high dimensional discrete data... and your text data most likely is exactly that: high-dimensional and discrete valued. The problem is that on such data, every point is more or less unique. I.e. outliers. No two form posts (except for spam maybe) are the same. And worse: they are also all more or less similar the same way with respect to squared euclidean distance (which is the distance k-means is optimal for).

    You are using k-means for a scenario that it wasn't designed for. So it's not surprising it doesn't work well. It's meant for quantization of low-dimensional continuous data; not for extracting meaningful groups out of text.