I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:
def dist(a,b):
a = a[np.newaxis,:,:]
b = b[:,np.newaxis,:]
dist = np.linalg.norm((a-b), axis=2)
return dist
Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?
scipy
has built-in functions for distance computations, which are lightning fast compared to home made implementations.
So, the first idea is to replace your whole distance
function by the following expression:
from numpy.random import rand
from scipy.spatial import distance
# sample data
a = randn(42000, 784
b = randn(256, 784)
# distance computation
dist = distance.cdist(a, b, metric='euclidean') # about 8.02 s on
# my 8 GB RAM machine
Note that dist
in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T
.
It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean')
.
This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.
The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn
:
from sklear.cluster import KMeans
a = randn(4200, 200)
km = KMeans(n_clusters=256)
km.fit(a) # about 10 s
One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.