Can I use cluster_center coordinates from a previous Kmeans fit as an init argument to sequentially update the cluster_center coordinates as new data arrives? Are there any drawbacks to this method?
UPDATED Online version of Scikit learns K-means:
KM = KMeans(n_clusters=3, random_state = 200, n_init = 1)
ni = 0
Until interrupted:
for x in data:
KM_updated = KM.fit(x)
Updated_centroids(i) = KM_updated.cluster_centers_(i) + 1/len(KM_updated.labels_(i) + 1) * (x - KM_updated.cluster_centers_(i))
KM = KMeans(n_clusters=3, random_state = 200, init = Updated_centroids(i), n_init = 1)
Yes it is a possible solution. However, you could further improve your implementation by following this pseudo-code (for more info take a look at this post Online k-means clustering):
Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
Acquire the next example, x
If mi is closest to x
Increment ni
Replace mi by mi + (1/ni)*( x - mi)
end_if
end_until
Following this version of the online algorithm you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can forget the new data point.
Compared to yours, in this solution you won't need to keep the past data, so it is computationally more efficient.
This exact implementation is not available in Scikit Learn. The closest implementation is probably the MiniBatchKMeans estimator with the partial_fit method.