Search code examples
pythoncluster-analysisdata-miningk-meanscentroid

how to choose initial centroids for k-means clustering


I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?


Solution

  • You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

    • Choose one center uniformly at random from among the data points.
    • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
    • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
    • Repeat Steps 2 and 3 until k centers have been chosen.
    • Now that the initial centers have been chosen, proceed using standard k-means clustering.