I have built a k-means clustering model using Sci-Kit Learn. I need to re-train the existing model daily, using new data.
I looked for any technique which can be used to re-train the existing model, but I couldn't find any straight forward method for this.
As I am getting large datasets daily, I cannot re-train the model from the beginning since it is not practical for the long run. Therefore, I need a method that can be used to re-train the existing model using new data.
You want to have a look at Online Learning techniques for that. Many scikit-learn
algorithms have an option to do a partial_fit
of the data, which means that you can incrementally train on small batches of data.
In your case, you can use sklearn.cluster.MiniBatchKMeans
, which is trained by taking a fraction of the samples (batch) to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, the model must still be trained through the method partial_fit
, otherwise it will retrain the whole model.