I have a very large Pandas DataFrame with 28171643 rows and 4 columns. A subset of this DataFrame for one day is shown below.
My task is now to calculate the total amount for pair
for each day. The days range from 90 to 320. Please note that (as the name says) the reversed
column always contains the flipped tuple from column pair
in each row.
Here an example:
Row 0, 1, 2 and 4 all contain the same tuple combination, wether it is in column pair
or reversed
and therefore need to be sum up to 5+17+1604+1558 = 3184. Ideally this information is than stored in a new DataFrame with columns day
, amount
and tuple
. It does not matter if tuple
contains the value from pair
or reversed
as the combination is not directed.
I have a solution show below, but this takes way too long for this huge dataset! As hardware I have a workstation with 48 cores 186GB RAM and a Quadro RTX 8000 GPU. If there is an easy solution with Dask for example or rapids.ai this is totally fine!
My slow approach on a day-basis:
If there is a way to parallelize this, it would also help!
def analysis(d, t):
combinations_df = d.loc[d['day'] == t]
index = []
for idx, row in combinations_df.iterrows():
idd = combinations_df[combinations_df['reversed'] == row['pair']].index
if len(idd) != 0:
index.append(idd[0])
else:
index.append(-1)
combinations_df['reversed_idx'] = index
skippy = []
to_drop = []
def add_occurences(row):
if row['reversed_idx'] == -1 or row['reversed_idx'] in skippy:
return row
else:
row['amount'] += combinations_df.loc[row['reversed_idx']]['amount']
skippy.append(row.name)
to_drop.append(row['reversed_idx'])
return row
res = combinations_df.apply(lambda x: add_occurences(x), axis=1)
skippy = set(skippy)
to_drop = list(set(to_drop))
return res.drop(to_drop)[['day', 'amount', 'pair']]
Similarly as previous response with using groupby and agg but make the sum on unique key combination:
result = my_df.groupby(['day', my_df.pair.apply(set).apply(tuple)])[['amount']].agg('sum').reset_index()
With a random 5000 length DataFrame, making a loop on days with your function take for me 4.38 s ± 204 ms and now, I'm at 9.86 ms ± 185 µs