Search code examples
pythonpandasdataframedistance

Compute a distance matrix in a pandas DataFrame


I would like to compute a distance distance between all elements of two series:

import pandas as pd
a = pd.Series([1,2,3], ['a', 'b', 'c'] )
b = pd.Series([4, 5, 6, 7], ['k', 'l', 'm', 'n'])

def dist(x, y):
    return x - y #(or some arbitrary function)

I did achieve the expected result using numpy and converting to a dataframe to add the index and columns.

import numpy as np
pd.DataFrame(a.values[np.newaxis, :] - b.values[:, np.newaxis],
             columns=a.index,
             index=b.index)

>>>    a  b  c
   k -3 -2 -1
   l -4 -3 -2
   m -5 -4 -3
   n -6 -5 -4

This does not feel as robust as direct operations on the DataFrame, is there a way to achieve this in pandas ?


Solution

  • In my opinion faster and better is use here numpy with broadcasting, but is possible only pandas solution in loop by Series.apply (slowier):

    print (b.apply(lambda x: dist(a, x)))
       a  b  c
    k -3 -2 -1
    l -4 -3 -2
    m -5 -4 -3
    n -6 -5 -4
    
    print (b.apply(lambda x: a - x))
       a  b  c
    k -3 -2 -1
    l -4 -3 -2
    m -5 -4 -3
    n -6 -5 -4
    

    #your solution (a bit simplier)
    df = pd.DataFrame(a.to_numpy() - b.to_numpy()[:, None],
                      columns=a.index,
                      index=b.index)
    print (df)
       a  b  c
    k -3 -2 -1
    l -4 -3 -2
    m -5 -4 -3
    n -6 -5 -4