Search code examples
machine-learningcluster-analysisk-means

K-means++ clustering Algorithm


The algorithm for the K-means++ is:

  1. Take one centroid c(i), chosen uniformly at random from the dataset.
  2. Take a new Centroid c(i), choosing an instance x(i) from the dataset with the probability D(X(i))^2/Sum(D(X(j))^2) from j=1 to m, where D(X(i)) is the distance between the instance and the closest centroid which is selected.

What is this parameter m used in the summation of the probability?


Solution

  • It might have been helpful to see the original formulation, but the algorithm is quite clear: during the initialization phase, for each point not used as a centroid, calculate the distance between said point and the nearest centroid, that will be the distance D(X[i]), the pick a random point in this set of points with probability weighted with D(X[i])^2

    In your formulation it seems you got m points unused.