I used the below code to create k-means clusters using Scikit learn.
kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++')
kmean_fit = kmean.fit(clus_data)
I also have saved the centroids using kmean_fit.cluster_centers_
I then pickled the K means object.
filename = pickle_path+'\\'+'_kmean_fit.sav'
pickle.dump(kmean_fit, open(filename, 'wb'))
So that I can load the same kmeans pickle object and apply it to new data when it comes, using kmean_fit.predict().
Questions :
Will the approach of loading kmeans pickle object and applying
kmean_fit.predict()
allow me to assign the new observation to
existing clusters based on centroid of the existing clusters? Does this approach just recluster from scratch on the new data?
If this method wont work how to assign the new observation to existing clusters given that I already have saved the cluster centriods using efficent python code?
PS: I know building a classifer using existing clusters as dependent variable is another way but I dont want to do that because of time crunch.
Yes. Whether the sklearn.cluster.KMeans
object is pickled or not (if you un-pickle it correctly, you'll be dealing with the "same" original object) does not affect that you can use the predict
method to cluster a new observation.
An example:
from sklearn.cluster import KMeans
from sklearn.externals import joblib
model = KMeans(n_clusters = 2, random_state = 100)
X = [[0,0,1,0], [1,0,0,1], [0,0,0,1],[1,1,1,0],[0,0,0,0]]
model.fit(X)
Out:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
verbose=0)
Continue:
joblib.dump(model, 'model.pkl')
model_loaded = joblib.load('model.pkl')
model_loaded
Out:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
verbose=0)
See how the n_clusters
and random_state
parameters are the same between the model
and model_new
objects? You're good to go.
Predict with the "new" model:
model_loaded.predict([0,0,0,0])
Out[64]: array([0])