python cluster-analysis data-mining k-means centroid

how to choose initial centroids for k-means clustering

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?

Solution

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.