python-2.7 machine-learning scikit-learn k-means

Kmeans clustering time on scikit

How much time should it take to cluster a set of 100'000 L2 normalized 2048-dim feature vectors using k-means with 200 clusters? I have all my data in a huge numpy array, maybe there's a more appropriate data structure?

It didn't seem to do any progress in an hour. I'm also inclining to use the threshold stopping criteria, but it seems to take more than 5 minutes for just 2 iterations. Is there some sort of verbose command I can use to check in on the progress for kmeans clustering on scikit-learn? Does anyone suggest any other approach? Maybe something like dimensionality reduction, or PCA and then kmeans? (I'm just throwing random ideas out there)

Solution

If you haven't tried it yet, use sklearn.cluster.MiniBatchKMeans instead of sklearn.cluster.KMeans

E.g., if X.shape = (100000, 2048), then write

from sklearn.cluster import MiniBatchKMeans
mbkm = MiniBatchKMeans(n_clusters=200)  # Take a good look at the docstring and set options here
mbkm.fit(X)

MiniBatchKMeans finds slightly different clusters from normal KMeans, but has the huge advantage that it is an online algorithm which doesn't need all the data at every iteration and still gives useful results.