Search code examples
pythongroup-byaggregate-functionspandas

Apply multiple functions to multiple groupby columns


The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?


Solution

  • The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

    If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

    df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
    df['group'] = [0, 0, 1, 1]
    df
    
              a         b         c         d  group
    0  0.418500  0.030955  0.874869  0.145641      0
    1  0.446069  0.901153  0.095052  0.487040      0
    2  0.843026  0.936169  0.926090  0.041722      1
    3  0.635846  0.439175  0.828787  0.714123      1
    

    A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

    df.groupby('group').agg({'a':['sum', 'max'], 
                             'b':'mean', 
                             'c':'sum', 
                             'd': lambda x: x.max() - x.min()})
    
                  a                   b         c         d
                sum       max      mean       sum  <lambda>
    group                                                  
    0      0.864569  0.446069  0.466054  0.969921  0.341399
    1      1.478872  0.843026  0.687672  1.754877  0.672401
    

    If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

    def max_min(x):
        return x.max() - x.min()
    
    max_min.__name__ = 'Max minus Min'
    
    df.groupby('group').agg({'a':['sum', 'max'], 
                             'b':'mean', 
                             'c':'sum', 
                             'd': max_min})
    
                  a                   b         c             d
                sum       max      mean       sum Max minus Min
    group                                                      
    0      0.864569  0.446069  0.466054  0.969921      0.341399
    1      1.478872  0.843026  0.687672  1.754877      0.672401
    

    Using apply and returning a Series

    Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

    I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

    def f(x):
        d = {}
        d['a_sum'] = x['a'].sum()
        d['a_max'] = x['a'].max()
        d['b_mean'] = x['b'].mean()
        d['c_d_prodsum'] = (x['c'] * x['d']).sum()
        return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
    
    df.groupby('group').apply(f)
    
             a_sum     a_max    b_mean  c_d_prodsum
    group                                           
    0      0.864569  0.446069  0.466054     0.173711
    1      1.478872  0.843026  0.687672     0.630494
    

    If you are in love with MultiIndexes, you can still return a Series with one like this:

        def f_mi(x):
            d = []
            d.append(x['a'].sum())
            d.append(x['a'].max())
            d.append(x['b'].mean())
            d.append((x['c'] * x['d']).sum())
            return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                       ['sum', 'max', 'mean', 'prodsum']])
    
    df.groupby('group').apply(f_mi)
    
                  a                   b       c_d
                sum       max      mean   prodsum
    group                                        
    0      0.864569  0.446069  0.466054  0.173711
    1      1.478872  0.843026  0.687672  0.630494