I have the following two dataframes
import pandas as pd
dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
't': dates*4,
'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})
df2 = df1.loc[df1['stage'] == 1]
What is the most efficient way of doing the operation below in python?
Select a.id
,a.t
,max(b.stage = 2) as flag
From df2 as a
Left join df1 as b
On a.id = b.id and a.t < b.t
Group by 1,2
This should help you out:
import pandas as pd
# Your data
dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
't': dates*4,
'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})
# Filter stage 1
df2 = df1.loc[df1['stage'] == 1]
# Merge df2 and df1 on 'id' with a left join
merged_df = df2.merge(df1, on='id', how='left', suffixes=('', '_b'))
# Filter rows where a.t < b.t
filtered_df = merged_df[merged_df['t'] < merged_df['t_b']]
# Group by 'id' and 't', and create a flag for max(b.stage = 2)
result = filtered_df.groupby(['id', 't']).agg(flag=('stage_b', lambda x: (x == 2).max())).reset_index()
print(result)