Search code examples
pythonpandasgroup-bycartesian-product

Efficient Cartesian product within groupby


I would like to do the Cartesian product (e.g. pandas merge with how='cross') between each row and all other rows, but in a groupby. I have found several ways to do it, but they take more than a day to run on my dataset so I'm trying to find a more efficient way to do it.

For instance, I can loop through my DataFrame, save the subsetted DF, do the merge, and then concatenate:

new_df = pd.DataFrame()
for o in set(df['owner']):
    tmp = df[df['owner']==o].reset_index(drop=True)
    tmp = tmp.merge(tmp, how='cross')
    new_df = pd.concat([tmp,new_df])

This will work, but literally takes a few days of running non-stop on my fairly powerful computer.

I apologize if this question is poorly worded or unclear, I'm used to finding solutions but not asking questions myself. Feel free to request clarification.


Solution

  • I would use itertools instead. Here is an example together with the time required to do it:

    import pandas as pd
    from itertools import product
    import time
    
    # Generate large random dataset
    n_rows = 100000
    n_groups = 1000
    df = pd.DataFrame({
        'owner': [f'group{i}' for i in range(n_groups) for _ in range(n_rows//n_groups)],
        'value': range(n_rows)
    })
    
    def cartesian_product(df):
        index_combinations = list(product(df.index, repeat=2))
        result = pd.DataFrame(index_combinations, columns=['index1', 'index2'])
        result = result.merge(df, left_on='index1', right_index=True)
        result = result.merge(df, left_on='index2', right_index=True, suffixes=('_1', '_2'))
        result = result[result['index1'] != result['index2']]
        return result.reset_index(drop=True)
    
    start_time = time.time()
    result = df.groupby('owner').apply(cartesian_product)
    elapsed_time = time.time() - start_time
    
    print(result)
    print(f"Elapsed time: {elapsed_time:.2f} seconds")
    
    

    which returns

                   index1  index2   owner_1  value_1   owner_2  value_2
    owner                                                              
    group0   0          1       0    group0        1    group0        0
             1          2       0    group0        2    group0        0
             2          3       0    group0        3    group0        0
             3          4       0    group0        4    group0        0
             4          5       0    group0        5    group0        0
    ...               ...     ...       ...      ...       ...      ...
    group999 9895   99994   99999  group999    99994  group999    99999
             9896   99995   99999  group999    99995  group999    99999
             9897   99996   99999  group999    99996  group999    99999
             9898   99997   99999  group999    99997  group999    99999
             9899   99998   99999  group999    99998  group999    99999
    
    [9900000 rows x 6 columns]
    Elapsed time: 13.59 seconds
    

    The same thing with your method

    
    start_time = time.time()
    new_df = pd.DataFrame()
    for o in set(df['owner']):
        tmp = df[df['owner']==o].reset_index(drop=True)
        tmp = tmp.merge(tmp, how='cross')
        new_df = pd.concat([tmp,new_df])
        
    elapsed_time = time.time() - start_time
    
    print(new_df)
    print(f"Elapsed time: {elapsed_time:.2f} seconds")
    

    returns

           owner_x  value_x   owner_y  value_y
    0     group628    62800  group628    62800
    1     group628    62800  group628    62801
    2     group628    62800  group628    62802
    3     group628    62800  group628    62803
    4     group628    62800  group628    62804
    ...        ...      ...       ...      ...
    9995  group146    14699  group146    14695
    9996  group146    14699  group146    14696
    9997  group146    14699  group146    14697
    9998  group146    14699  group146    14698
    9999  group146    14699  group146    14699
    
    [10000000 rows x 4 columns]
    Elapsed time: 97.56 seconds
    

    So, apply the method I suggested to your particular need.