Search code examples
pythondataframescipyeuclidean-distancepdist

Euclidean distance and indicator from a large dataframe


I have a large Dataframe (189090, 8), I need to calculate Euclidean distance and the similarity.

My approach:

from scipy.spatial import KDTree
from scipy.spatial.distance import pdist

scaler = MinMaxScaler()
scaled = scaler.fit_transform(ds)

Y = pdist(scaled)

Y_squared = squareform(Y)

X_tree = KDTree(Y_squared)

dist, ind = X_tree.query(Y_squared, k=4)

But when I run the code my notebook (kernel shut down) or my pycharm kill. But if I reduce the shape of the dataframe (e.g 5000, 8), the process runs normally.

I tried to reduce the memory used by the dataframe, however still did not function. I know that the code that does not run is this Y = pdist(scaled)

How can I make this work?


Solution

  • According to the documentation, pdist "returns a condensed distance matrix". That means it would try to calculate and return a matrix of about 189090^2/2 = 17877514050 entries, causing your computer run out of ram.

    If you want to calculate distances between some specific data points, filter them out before using pdist.

    If you really want to calculate the entire distance matrix, it's better to calculate distances of a small partition of data points at a time (e.g. 1000), and save the result in the disk.