Search code examples
pythonnumpyscipyscipy-spatial

Memory efficient mean pairwise distance


I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.

>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626

In the example above y gets quite large (nearly 2,500 times as large as the input array):

>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793

But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.

Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?


Solution

  • If you use the square version of distance, it is equivalent to using the variance with n-1:

    from scipy.spatial.distance import pdist, squareform
    import numpy as np
    x = np.random.rand(10000, 2)
    y = np.array([[1,1], [0,0], [2,0]])
    print(pdist(x, 'sqeuclidean').mean())
    print(np.var(x, 0, ddof=1).sum()*2)
    >>0.331474285845873
    0.33147428584587346