I would like to do the Cartesian product (e.g. pandas merge with how='cross') between each row and all other rows, but in a groupby. I have found several ways to do it, but they take more than a day to run on my dataset so I'm trying to find a more efficient way to do it.
For instance, I can loop through my DataFrame, save the subsetted DF, do the merge, and then concatenate:
new_df = pd.DataFrame()
for o in set(df['owner']):
tmp = df[df['owner']==o].reset_index(drop=True)
tmp = tmp.merge(tmp, how='cross')
new_df = pd.concat([tmp,new_df])
This will work, but literally takes a few days of running non-stop on my fairly powerful computer.
I apologize if this question is poorly worded or unclear, I'm used to finding solutions but not asking questions myself. Feel free to request clarification.
I would use itertools instead. Here is an example together with the time required to do it:
import pandas as pd
from itertools import product
import time
# Generate large random dataset
n_rows = 100000
n_groups = 1000
df = pd.DataFrame({
'owner': [f'group{i}' for i in range(n_groups) for _ in range(n_rows//n_groups)],
'value': range(n_rows)
})
def cartesian_product(df):
index_combinations = list(product(df.index, repeat=2))
result = pd.DataFrame(index_combinations, columns=['index1', 'index2'])
result = result.merge(df, left_on='index1', right_index=True)
result = result.merge(df, left_on='index2', right_index=True, suffixes=('_1', '_2'))
result = result[result['index1'] != result['index2']]
return result.reset_index(drop=True)
start_time = time.time()
result = df.groupby('owner').apply(cartesian_product)
elapsed_time = time.time() - start_time
print(result)
print(f"Elapsed time: {elapsed_time:.2f} seconds")
which returns
index1 index2 owner_1 value_1 owner_2 value_2
owner
group0 0 1 0 group0 1 group0 0
1 2 0 group0 2 group0 0
2 3 0 group0 3 group0 0
3 4 0 group0 4 group0 0
4 5 0 group0 5 group0 0
... ... ... ... ... ... ...
group999 9895 99994 99999 group999 99994 group999 99999
9896 99995 99999 group999 99995 group999 99999
9897 99996 99999 group999 99996 group999 99999
9898 99997 99999 group999 99997 group999 99999
9899 99998 99999 group999 99998 group999 99999
[9900000 rows x 6 columns]
Elapsed time: 13.59 seconds
The same thing with your method
start_time = time.time()
new_df = pd.DataFrame()
for o in set(df['owner']):
tmp = df[df['owner']==o].reset_index(drop=True)
tmp = tmp.merge(tmp, how='cross')
new_df = pd.concat([tmp,new_df])
elapsed_time = time.time() - start_time
print(new_df)
print(f"Elapsed time: {elapsed_time:.2f} seconds")
returns
owner_x value_x owner_y value_y
0 group628 62800 group628 62800
1 group628 62800 group628 62801
2 group628 62800 group628 62802
3 group628 62800 group628 62803
4 group628 62800 group628 62804
... ... ... ... ...
9995 group146 14699 group146 14695
9996 group146 14699 group146 14696
9997 group146 14699 group146 14697
9998 group146 14699 group146 14698
9999 group146 14699 group146 14699
[10000000 rows x 4 columns]
Elapsed time: 97.56 seconds
So, apply the method I suggested to your particular need.