python pandas group-by cartesian-product

Efficient Cartesian product within groupby

I would like to do the Cartesian product (e.g. pandas merge with how='cross') between each row and all other rows, but in a groupby. I have found several ways to do it, but they take more than a day to run on my dataset so I'm trying to find a more efficient way to do it.

For instance, I can loop through my DataFrame, save the subsetted DF, do the merge, and then concatenate:

new_df = pd.DataFrame()
for o in set(df['owner']):
    tmp = df[df['owner']==o].reset_index(drop=True)
    tmp = tmp.merge(tmp, how='cross')
    new_df = pd.concat([tmp,new_df])

This will work, but literally takes a few days of running non-stop on my fairly powerful computer.

I apologize if this question is poorly worded or unclear, I'm used to finding solutions but not asking questions myself. Feel free to request clarification.

Solution

I would use itertools instead. Here is an example together with the time required to do it:

import pandas as pd
from itertools import product
import time

# Generate large random dataset
n_rows = 100000
n_groups = 1000
df = pd.DataFrame({
    'owner': [f'group{i}' for i in range(n_groups) for _ in range(n_rows//n_groups)],
    'value': range(n_rows)
})

def cartesian_product(df):
    index_combinations = list(product(df.index, repeat=2))
    result = pd.DataFrame(index_combinations, columns=['index1', 'index2'])
    result = result.merge(df, left_on='index1', right_index=True)
    result = result.merge(df, left_on='index2', right_index=True, suffixes=('_1', '_2'))
    result = result[result['index1'] != result['index2']]
    return result.reset_index(drop=True)

start_time = time.time()
result = df.groupby('owner').apply(cartesian_product)
elapsed_time = time.time() - start_time

print(result)
print(f"Elapsed time: {elapsed_time:.2f} seconds")

which returns

               index1  index2   owner_1  value_1   owner_2  value_2
owner                                                              
group0   0          1       0    group0        1    group0        0
         1          2       0    group0        2    group0        0
         2          3       0    group0        3    group0        0
         3          4       0    group0        4    group0        0
         4          5       0    group0        5    group0        0
...               ...     ...       ...      ...       ...      ...
group999 9895   99994   99999  group999    99994  group999    99999
         9896   99995   99999  group999    99995  group999    99999
         9897   99996   99999  group999    99996  group999    99999
         9898   99997   99999  group999    99997  group999    99999
         9899   99998   99999  group999    99998  group999    99999

[9900000 rows x 6 columns]
Elapsed time: 13.59 seconds

The same thing with your method


start_time = time.time()
new_df = pd.DataFrame()
for o in set(df['owner']):
    tmp = df[df['owner']==o].reset_index(drop=True)
    tmp = tmp.merge(tmp, how='cross')
    new_df = pd.concat([tmp,new_df])
    
elapsed_time = time.time() - start_time

print(new_df)
print(f"Elapsed time: {elapsed_time:.2f} seconds")

returns

       owner_x  value_x   owner_y  value_y
0     group628    62800  group628    62800
1     group628    62800  group628    62801
2     group628    62800  group628    62802
3     group628    62800  group628    62803
4     group628    62800  group628    62804
...        ...      ...       ...      ...
9995  group146    14699  group146    14695
9996  group146    14699  group146    14696
9997  group146    14699  group146    14697
9998  group146    14699  group146    14698
9999  group146    14699  group146    14699

[10000000 rows x 4 columns]
Elapsed time: 97.56 seconds

So, apply the method I suggested to your particular need.