Search code examples
pythongroup-byleft-join

Python inequality join with group by


I have the following two dataframes

import pandas as pd

dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
                    't': dates*4,
                    'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})

df2 = df1.loc[df1['stage'] == 1]

What is the most efficient way of doing the operation below in python?

Select a.id
      ,a.t
      ,max(b.stage = 2) as flag 
From df2 as a
Left join df1 as b
  On a.id = b.id and a.t < b.t
Group by 1,2

Solution

  • This should help you out:

    import pandas as pd
    
    # Your data
    
    dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
    df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
                        't': dates*4,
                        'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})
    
    # Filter stage 1
    
    df2 = df1.loc[df1['stage'] == 1]
    
    # Merge df2 and df1 on 'id' with a left join
    
    merged_df = df2.merge(df1, on='id', how='left', suffixes=('', '_b'))
    
    # Filter rows where a.t < b.t
    
    filtered_df = merged_df[merged_df['t'] < merged_df['t_b']]
    
    # Group by 'id' and 't', and create a flag for max(b.stage = 2)
    
    result = filtered_df.groupby(['id', 't']).agg(flag=('stage_b', lambda x: (x == 2).max())).reset_index()
    
    print(result)