Search code examples
pythonpandaspandas-groupby

how to speed up conditional groupby sum in pandas


I have a dataframe with huge amount of rows, and I want to conditional groupby sum to this dataframe.

This is an example of my dataframe and code:

import pandas as pd

data = {'Case': [1, 1, 1, 1, 1, 1],
        'Id': [1, 1, 1, 1, 2, 2],
        'Date1': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01', '2020-01-01', '2020-01-01'],
        'Date2': ['2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01'],
        'Quantity': [50,100,150,20,30,35]
        }

df = pd.DataFrame(data)

df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

sum_list = []


for d in df['Date1'].unique():
    temp = df.groupby(['Case','Id']).apply(lambda x: x[(x['Date2'] == d) & (x['Date1']<d)]['Quantity'].sum()).rename('sum').to_frame()
    temp['Date'] = d
    sum_list.append(temp)
    

output = pd.concat(sum_list, axis=0).reset_index()

When I apply this for loop to the real dataframe, it's extremely slow. I want to find a better way to do this conditional groupby sum operation. Here are my questions:

  1. is for loop a good method to do what I need here?
  2. are there any better ways to replace line 1 inside for loop;
  3. I feel line 2 inside for loop is also time-consuming, how should I improve it.

Thanks for your help.


Solution

  • One option is a double merge and a groupby:

    date = pd.Series(df.Date1.unique(), name='Date')
    step1 = df.merge(date, left_on = 'Date2', right_on = 'Date', how = 'outer')
    step2 = step1.loc[step1.Date1 < step1.Date]
    step2 = step2.groupby(['Case', 'Id', 'Date']).agg(sum=('Quantity','sum'))
    (df
    .loc[:, ['Case', 'Id', 'Date2']]
    .drop_duplicates()
    .rename(columns={'Date2':'Date'})
    .merge(step2, how = 'left', on = ['Case', 'Id', 'Date'])
    .fillna({'sum': 0}, downcast='infer')
    )
    
       Case  Id       Date  sum
    0     1   1 2020-01-01    0
    1     1   1 2020-02-01  100
    2     1   2 2020-01-01    0
    3     1   2 2020-02-01   35