Search code examples
pythonpandasgroup-by

groupby and aggregate on the resulting groups


I have a dataframe with a column Status which values are either 'OPEN' or 'CLOSED'. I would like to do a groupby() on multiple columns and use the following rule for that Status column: if there is one or more 'OPEN' values in the group then the aggregate should return 'OPEN' else 'CLOSED'

I tried the following:

df_agg = df.groupby(['col1', 'col2', 'col3'], as_index=False)
           .agg({'col4': 'sum', 'Status': lambda x: np.where(x == 'OPEN', 'OPEN', 'CLOSED')}).reset_index(drop=True)

but this returns lists like ['OPEN', 'CLOSED'], ['OPEN', 'CLOSED', 'CLOSED'] and so on. Is there a better way using the agg() function to return a single value of 'OPEN' or 'CLOSED' rather than doing the groupby() and then doing again something like below?

np.where(df_agg['Status'].str.contains('OPEN'), 'OPEN', 'CLOSED')

Solution

  • Yes, it is possible with any test if at least one match OPEN in lambda function:

    #if need test substring OPEN
    lambda x: 'OPEN' if x.str.contains('OPEN').any() else 'CLOSED'
    
    #if need test string OPEN
    lambda x: 'OPEN' if x.eq('OPEN').any() else 'CLOSED'
    

    Better is test before groupby and because OPEN > CLOSED alpha numerically aggregate max:

    #if need test substring OPEN
    df_agg = (df.assign(Status = np.where(df['Status'].str.contains('OPEN'), 
                                          'OPEN', 'CLOSED'))
              .groupby(['col1', 'col2', 'col3'], as_index=False)
               .agg({'col4': 'sum', 'Status': 'max'}))
    
    #if need test string OPEN
    df_agg = (df.assign(Status = np.where(df['Status'].eq('OPEN'), 
                                          'OPEN', 'CLOSED'))
              .groupby(['col1', 'col2', 'col3'], as_index=False)
               .agg({'col4': 'sum', 'Status': 'max'}))
    

    Or:

    #if need test substring OPEN
    df_agg = (df.assign(Status = df['Status'].str.contains('OPEN'))
               .groupby(['col1', 'col2', 'col3'], as_index=False)
               .agg({'col4': 'sum', 'Status': 'any'})
               .assign(Status = lambda x: x['Status'].map({True:'OPEN',False:'CLOSED'}))
    
    #if need test string OPEN
    df_agg = (df.assign(Status = df['Status'].eq('OPEN'))
               .groupby(['col1', 'col2', 'col3'], as_index=False)
               .agg({'col4': 'sum', 'Status': 'any'})
               .assign(Status = lambda x: x['Status'].map({True:'OPEN',False:'CLOSED'}))