Search code examples
pythonpandasparallel-processingdasklarge-data

For each day get the sum of all rows in a very large Pandas DataFrame which match in two specific columns


I have a very large Pandas DataFrame with 28171643 rows and 4 columns. A subset of this DataFrame for one day is shown below.

My task is now to calculate the total amount for pair for each day. The days range from 90 to 320. Please note that (as the name says) the reversed column always contains the flipped tuple from column pair in each row.

Here an example:

Row 0, 1, 2 and 4 all contain the same tuple combination, wether it is in column pair or reversed and therefore need to be sum up to 5+17+1604+1558 = 3184. Ideally this information is than stored in a new DataFrame with columns day, amount and tuple. It does not matter if tuple contains the value from pair or reversed as the combination is not directed.

I have a solution show below, but this takes way too long for this huge dataset! As hardware I have a workstation with 48 cores 186GB RAM and a Quadro RTX 8000 GPU. If there is an easy solution with Dask for example or rapids.ai this is totally fine!

My slow approach on a day-basis:

If there is a way to parallelize this, it would also help!

def analysis(d, t):

    combinations_df = d.loc[d['day'] == t]

    index = []

    for idx, row in combinations_df.iterrows():
        
        idd = combinations_df[combinations_df['reversed'] == row['pair']].index
        
        if len(idd) != 0:
            index.append(idd[0])
        else:
            index.append(-1)

    combinations_df['reversed_idx'] = index

    skippy = []
    to_drop = []
    
    def add_occurences(row):
        if row['reversed_idx'] == -1 or row['reversed_idx'] in skippy:
            return row
        else:
            row['amount'] += combinations_df.loc[row['reversed_idx']]['amount']
            skippy.append(row.name)
            to_drop.append(row['reversed_idx'])
            return row

    res = combinations_df.apply(lambda x: add_occurences(x), axis=1)
    skippy = set(skippy)
    to_drop = list(set(to_drop))
    return res.drop(to_drop)[['day', 'amount', 'pair']]

enter image description here


Solution

  • Similarly as previous response with using groupby and agg but make the sum on unique key combination:

    result = my_df.groupby(['day', my_df.pair.apply(set).apply(tuple)])[['amount']].agg('sum').reset_index()
    

    With a random 5000 length DataFrame, making a loop on days with your function take for me 4.38 s ± 204 ms and now, I'm at 9.86 ms ± 185 µs