Search code examples
pythonpandasdataframegroup-bysummary

Aggregate multiple columns in a dataframe based on custom functions


Afternoon All,

I have been trying to resolve this for awhile, any help would be appreciated.

Here is my dataframe:

Channel state       rfq_qty
A        Done       10
B        Tied Done  10
C        Done       10
C        Done       10
C        Done       10
C        Tied Done  10
B        Done       10
B        Done       10

I would like to:

  1. Group by channel, then state
  2. Sum the rfq_qty for each channel
  3. Count the occurences of each 'done' string in state ('Done' is treated the same as 'Tied Done' i.e. anything with 'done' in it)
  4. Display the channels rfq_qty as a percentage of the total number of rfq_qty (80)
Channel state   rfq_qty Percentage
A         1       10    0.125
B         3       30    0.375
C         4       40    0.5

I have attempted this with the following:

df_Done = df[
                (
                    df['state']=='Done'
                ) 
                | 
                (
                    df['state'] == 'Tied Done'
                )
            ][['Channel','state','rfq_qty']]

df_Done['Percentage_Qty']= df_Done['rfq_qty']/df_Done['rfq_qty'].sum()
df_Done['Done_Trades']= df_Done['state'].count()

display(
        df_Done[
                (df_Done['Channel'] != 0)
               ].groupby(['Channel'])['Channel','Count of Done','rfq_qty','Percentage_Qty'].sum().sort_values(['rfq_qty'], ascending=False)
       )

Works but looks convoluted. Any improvements?


Solution

  • I think you can use:

    • first filter by isin and loc
    • groupby and aggregate by agg with tuples of new columns names and functions
    • add Percentage by divide by div and sum
    • last if necessary sort_values by rfq_qty

    df_Done = df.loc[df['state'].isin(['Done', 'Tied Done']), ['Channel','state','rfq_qty']]
    
    #if want filter all values contains Done
    #df_Done = df[df['state'].str.contains('Done')]
    
    #if necessary filter out Channel == 0
    #mask = (df['Channel'] != 0) & df['state'].isin(['Done', 'Tied Done'])
    #df_Done = df.loc[mask, ['Channel','state','rfq_qty']]
    
    d = {('rfq_qty', 'sum'), ('Done_Trades','size')}
    df = df_Done.groupby('Channel')['rfq_qty'].agg(d).reset_index()
    df['Percentage'] = df['rfq_qty'].div(df['rfq_qty'].sum())
    df = df.sort_values('rfq_qty')
    print (df)
      Channel  Done_Trades  rfq_qty  Percentage
    0       A            1       10       0.125
    1       B            3       30       0.375
    2       C            4       40       0.500