Search code examples
k-means

k-means empty cluster


I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:

During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.

That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.

If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?


Solution

  • Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.

    Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.