I have code that can do this, but I am iterating through each row of the dataframe with iterrows()
. It takes quite a long time to process considering it's checking through over 6M rows. And want to use vectorisation to speed it up.
I've looked at using pd.Grouper
and freq
, but have gotten stuck on how to use the 2 dataframes to do this check with that.
Given the 2 dataframes below:
I want to look at all rows in df1
(grouped by 'sid'
and 'modtype'
):
df1:
sid servid date modtype service
0 123 881 2022-07-05 A1 z
1 456 879 2022-07-02 A2 z
Then find them in df2
and count the occurrences of those groups within 3 days of the date of that group in df1
, to get a count of how many times that group comes within 3 days before, and a count of occurrences it comes within 3 days after.
df2:
sid servid date modtype
0 123 1234 2022-07-03 A1
1 123 881 2022-07-05 A1
2 123 65781 2022-07-06 A1
3 123 8552 2022-07-30 A1
4 123 3453 2022-07-04 A2
5 123 5681 2022-07-07 A2
6 456 78 2022-07-01 A1
7 456 26744 2022-05-05 A2
8 456 56166 2022-06-29 A2
9 456 56717 2022-06-30 A2
10 456 879 2022-07-02 A2
11 456 56 2022-07-25 A2
So, essentially, in the sample set which I provide below, my output would end up with:
sid servid date modtype service cnt_3day_before cnt_3day_after
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 2 0
Sample set:
import pandas as pd
data1 = {
'sid':['123','456'],
'servid':['881','879'],
'date':['2022-07-05','2022-07-02'],
'modtype':['A1','A2'],
'service':['z','z']}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
data2 = {
'sid':['123','123','123','123','123','123',
'456','456','456','456','456','456'],
'servid':['1234','3453','881','65781','5681','8552',
'26744','56717','879','56166','56','78'],
'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
'2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
'modtype':['A1','A2','A1','A1','A2','A1',
'A2','A2','A2','A2','A2','A1']}
df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
# Merge the dataframes on sid and modtype
keys = ['sid', 'modtype']
s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])
# Create boolean condtitions as per requirements
s['cnt_3day_after'] = s['date'].between(s['date_'], s['date_'] + pd.DateOffset(days=3), inclusive='right')
s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )
# group the boolean conditions by sid and modtype
# and aggregate with sum to count the number of True values
s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()
# Join the aggregated counts back with df1
df_out = df1.join(s, on=keys)
print(df_out)
sid servid date modtype service cnt_3day_after cnt_3day_before
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 0 2