Memory Error Numpy/Python Euclidean Distance

I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:

def dist(a,b):
    a = a[np.newaxis,:,:]
    b = b[:,np.newaxis,:]
    dist = np.linalg.norm((a-b), axis=2) 
    return dist

Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?

Solution

scipy has built-in functions for distance computations, which are lightning fast compared to home made implementations.

So, the first idea is to replace your whole distance function by the following expression:

from numpy.random import rand
from scipy.spatial import distance

# sample data
a = randn(42000, 784
b = randn(256, 784)

# distance computation
dist = distance.cdist(a, b, metric='euclidean')    # about 8.02 s on 
                                                   # my 8 GB RAM machine

Note that dist in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T.

It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean').

This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.

The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn:

from sklear.cluster import KMeans
a = randn(4200, 200)

km = KMeans(n_clusters=256)
km.fit(a)                    # about 10 s

One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.