Search code examples
pythonpandasdataframelarge-data

Python dataframes Cartesian operation on large amount of data


I have 2 dataframes, both with around 30k rows and 8 columns, and I need to subtract the values of every row in the first df from values of every row in second df(to compute the Euclidian distance between every pair of rows) which will probably result in a 3d structure of only the differences between every pair of rows. I've tried several approaches but each one takes a very long time to complete. Is there an efficient way to do this?


Solution

  • For what is worth, your Cartesian product can be done as follows:

    import pandas as pd
    
    df1 = pd.DataFrame({'A': [1,2,3]})
    df2 = pd.DataFrame({'B': [4,5,6]})
    
    df3 = pd.merge(df1.assign(key=1), df2.assign(key=1), on='key').drop('key', axis=1)
    df3
    #   A  B
    #0  1  4
    #1  1  5
    #2  1  6
    #3  2  4
    #4  2  5
    #5  2  6
    #6  3  4
    #7  3  5
    #8  3  6