Search code examples
pythonpandassimilarityeuclidean-distance

Extract distances after running scipy.spatial.distance.pdist


I have a Pandas data frame (see small example below). I want to calculate Euclidean distances between observations (rows) based on their values in 3 columns (features). I am using scipy.spatial.distance.pdist.

I understand that the returned object (dist) contains 190 distances between my 20 observations (rows). I assume, it's an "unfurled" triangular matrix - with distances between the 1st row and the second, then, probably, between the 1st row and the third, ... between 1st and the 20th, then between 2nd and 3rd, 2nd and 4th, etc, etc.

However, I am not sure. And: how could I build a symmetrical 20 by 20 matrix with distances in it?

My ultimate objective: For each observation (row) I want to find its closest 5 neighbors (i.e., rows with the smallest distance from it) and sum up those 5 distances. If I had a square matrix, I could just apply a function to each column. But right now I am not sure how to deal with 'dist'.

Thanks a lot for your help!

import numpy as np
import pandas as pd
# Generate fake Pandas data frame
a = pd.Series(np.random.normal(1, 0.1, 20))
df = pd.DataFrame(a, columns=['a'])
df['b'] = pd.Series(np.random.normal(2, 0.1, 20))
df['c'] = pd.Series(np.random.normal(3, 0.1, 20))
import scipy
dist = scipy.spatial.distance.pdist(df, metric='euclidean')

dist.shape # (190,)


Solution

  • You can pass dist to scipy.spatial.distance.squareform. It converts a n-by-1 array of pairwise distances to a square matrix form.

    d_matrix = scipy.spatial.distance.squareform(dist)