python machine-learning scikit-learn k-means pre-trained-model

How do I re-train an existing K-Means clustering model

I have built a k-means clustering model using Sci-Kit Learn. I need to re-train the existing model daily, using new data.

I looked for any technique which can be used to re-train the existing model, but I couldn't find any straight forward method for this.

As I am getting large datasets daily, I cannot re-train the model from the beginning since it is not practical for the long run. Therefore, I need a method that can be used to re-train the existing model using new data.

Solution

You want to have a look at Online Learning techniques for that. Many scikit-learn algorithms have an option to do a partial_fit of the data, which means that you can incrementally train on small batches of data.

In your case, you can use sklearn.cluster.MiniBatchKMeans, which is trained by taking a fraction of the samples (batch) to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, the model must still be trained through the method partial_fit, otherwise it will retrain the whole model.