Search code examples
pythonpandasgroup-bycountvectorization

Efficiently count occurrences of a value with groupby and within a date range


I have code that can do this, but I am iterating through each row of the dataframe with iterrows(). It takes quite a long time to process considering it's checking through over 6M rows. And want to use vectorisation to speed it up.

I've looked at using pd.Grouper and freq, but have gotten stuck on how to use the 2 dataframes to do this check with that.

Given the 2 dataframes below:

I want to look at all rows in df1 (grouped by 'sid' and 'modtype'):

df1:

   sid servid       date modtype service
0  123    881 2022-07-05      A1       z
1  456    879 2022-07-02      A2       z

Then find them in df2 and count the occurrences of those groups within 3 days of the date of that group in df1, to get a count of how many times that group comes within 3 days before, and a count of occurrences it comes within 3 days after.

df2:

    sid servid       date modtype
0   123   1234 2022-07-03      A1
1   123    881 2022-07-05      A1
2   123  65781 2022-07-06      A1
3   123   8552 2022-07-30      A1
4   123   3453 2022-07-04      A2
5   123   5681 2022-07-07      A2
6   456     78 2022-07-01      A1
7   456  26744 2022-05-05      A2
8   456  56166 2022-06-29      A2
9   456  56717 2022-06-30      A2
10  456    879 2022-07-02      A2
11  456     56 2022-07-25      A2

So, essentially, in the sample set which I provide below, my output would end up with:

   sid servid       date modtype service  cnt_3day_before   cnt_3day_after
0  123    881 2022-07-05      A1       z    1                 1
1  456    879 2022-07-02      A2       z    2                 0

Sample set:

import pandas as pd

data1 = {
    'sid':['123','456'],
    'servid':['881','879'],
    'date':['2022-07-05','2022-07-02'],
    'modtype':['A1','A2'],
    'service':['z','z']}

df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)



data2 = {
        'sid':['123','123','123','123','123','123',
               '456','456','456','456','456','456'],
        'servid':['1234','3453','881','65781','5681','8552',
                  '26744','56717','879','56166','56','78'],
        'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
                '2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
        'modtype':['A1','A2','A1','A1','A2','A1',
                   'A2','A2','A2','A2','A2','A1']}

df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)

Solution

  • Annotated code

    # Merge the dataframes on sid and modtype
    keys = ['sid', 'modtype']
    s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])
    
    # Create boolean condtitions as per requirements
    s['cnt_3day_after']  = s['date'].between(s['date_'], s['date_'] + pd.DateOffset(days=3), inclusive='right')
    s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )
    
    # group the boolean conditions by sid and modtype
    # and aggregate with sum to count the number of True values
    s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()
    
    # Join the aggregated counts back with df1
    df_out = df1.join(s, on=keys)
    

    Result

    print(df_out)
    
       sid servid       date modtype service  cnt_3day_after  cnt_3day_before
    0  123    881 2022-07-05      A1       z               1                1
    1  456    879 2022-07-02      A2       z               0                2