Search code examples
algorithmmathk-means

Initial centroids in k-means


So I found a description online that says:

Start with the center of all points. Choose successively the point that is the furthest away from all centers as a center for the next cluster.

So from this I take that:

center = avg of all points

centroid1 = point furthest away from center

centroid2 = point furthest away from center AND cencroid1

centroid3 = point furthest away from center AND cencroid1 AND centroid2.

My problem is, how am I supposed to calculate for example the furthest point from center and centroid1? Do I average them and then choose the furthest point from the middle? Do I calculate the max distance point from both center and centroid1 and choose the further one? If so, wouldn't centroid3 become equal to centroid1 or 2?


Solution

  • In this document Centroids Initialization for K-Means Clustering using Improved Pillar Algorithm furthest means sum. So, on the second step you need to sum distance from the first centroid and distance form the average of all points for every point and then choose the biggest one.

    Relevant lines in provided pseudo-code are

    2. Calculate D <- dis(X, m)
    ...
    6. Set i = 1 as counter to determine the i-th initial centroid
    7. DM = DM + D
    8. Select x <- xargmax(DM) as the candidate for i-th initial centroids
    

    To select a next x for the candidate of the rest initial centroids, Di (where i is the current iteration step) is recalculated between each data points and ci-1 . The Di is then added to the accumulated distance metric DM (DM <- DM + Di).