Search code examples
pythonpandasdistanceseries

How to calculate pairwise Euclidean distance between a collection of vectors


I have a pandas data frame like this. Where the index is pd.DatetimeIndex and the columns are timeseries.

x_1 x_2 x_3
2020-08-17 133.23 2457.45 -4676
2020-08-18 -982 -6354.56 -245.657
2020-08-19 5678.642 245.2786 2461.785
2020-08-20 -2394 154.34 -735.653
2020-08-20 236 -8876 -698.245

I need to calculate the Euclidean distance of all the columns against each other. I.e., (x_1 - x_2), (x_1 - x_3), (x_2 - x_3), and return a square data frame like this: (Please realize that the values in this table are just an example and not the actual result of the Euclidean distance)

x_1 x_2 x_3
x_1 0 123 456
x_2 123 0 789
x_3 456 789 0

I tried this resource but I could not figure out how to pass the columns of my df. If understand correctly the example passes the rows as the series to calculate the ED from.


Solution

  • An explicit way of achieving this would be:

    from itertools import combinations
    
    import numpy as np
    
    dist_df = pd.DataFrame(index=df.columns, columns=df.columns)
    
    for col_a, col_b in combinations(df.columns, 2):
        dist = np.linalg.norm(df[col_a] - df[col_b])
        dist_df.loc[col_a, col_b] = dist
        dist_df.loc[col_b, col_a] = dist
    
    print(dist_df)
    

    outputs

                  x_1           x_2           x_3
    x_1           NaN  12381.858429   6135.306973
    x_2  12381.858429           NaN  12680.121047
    x_3   6135.306973  12680.121047           NaN
    

    If you want 0 instead of NaN use DataFrame.fillna:

    dist_df.fillna(0, inplace=True)