java cluster-analysis weka data-mining k-means

Significance of "seed" in weka K means clustering

The weka SimpleKMeans implementation allows the user to specify a "seed value" with the option -s. I do not understand what it signifies. In this link, Mark Hall, the weka architect, says that it is supposed to generate random numbers.

Weka implementation is supposed to follow the paper on KMenas++ (as mentioned in the documentation), and if I have understood it, cluster centroid points are decided using the equation 1b, section 2.2 page 3 in the aforementioned paper, and there is no other source of randomness.

Can anyone please point out that what am I getting wrong?

Solution

It is a common best practise with k-means algorithms (note: there is more than one algorithm for k-means; they are heuristics as finding the optimal solution is reported to be NP-hard) to do multiple iterations with different random initial centers.

So the randomness is commonly involved with choosing the initial centers. K-means++ is an alternate way of choosing initial seeds that fortunately is still randomized (there are some that are not randomized, so you can no longer try to improve your results with multiple runs then), but tries to choose a better starting situation.

Why are you looking for another source of randomness than the inital means?

I don't recommend Weka for clustering. It is okay for classification, but it has quite limited support for clustering and other unsupervised methods. Instead, have a look at ELKI. their k-means package for example is quite exhaustive. They have ~6 different methods for choosing the initial means. Most are randomized. The simplest and most common intialization is probably to just start with k random objects from the database. IIRC, MacQueen used the first k objects, so that variant is not randomized (unless you shuffle your data set first, which actually is a good idea for quite some algorithms - never use sorted data!) Most of these initializers will therefore come with a parameter -kmeans.seed which - guess what - allows you to control the random generator seeding, for reproducible results.