Search code examples
python-3.xpandasdataframegroup-byaggregate

[pandas]: Groupby on column(s), then define multiple (including some custom) aggregation functions simultaneously


I need to be able to define, in one groupby/agg statement, the sum & the division of two separate columns and save them as a new column, and then calculate the sum of another column.

Mods, please examine my desired output first before reviewing this question. Looking at other questions on this site, I could only figure out how to write custom aggregation functions for one aggregation column, not multiple.

For example, Python - Pandas data frame: customized aggregation function after groupy? asks a similar question, but provides no advice as to how to define custom functions for multiple columns, simultaneously.


Below is a problem statement, my attempt at a solution, and my desired result.

Here is my dataframe:

df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
                   'is_orange': [1, 1, 0, 0, 1, 0, 1],
                   'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
                   'melons':     [73, 81, 94, 174, 23, 71, 65]})

I would like to do something like this:

df.groupby(['location']).agg(
    'total orange/non-orange' : df['is_orange'] + df['is_non_orange'],
    'percent_orange'          : df['is_orange'] / (df['is_orange'] + df['is_non_orange']),
    'sum_melons'              : sum(df['melons']))

With the desired output being:

df = 

location    total_orange/non-orange    percentage_oranges    melons
backyard                        3.0                  0.66       270
bank                            2.0                  0.00       165
store                           2.0                  1.00       146

Thanks in advance.


Solution

  • Another answer (inspired by discussions with Ry on the Python Discord server, and my own testing):

    Define the dataframe:

    df = pd.DataFrame({
        'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
        'is_orange': [1, 1, 0, 0, 1, 0, 1],
        'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
        'melons': [73, 81, 94, 174, 23, 71, 65]
    })
    

    We then define a function that will be performed over each of the subgroups defined by the eventual .groupby() statement.

    def stats(df_subgroup):
        return pd.Series({
        'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
        'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
        'melons': (df_subgroup['melons']).sum()
    })
    

    The output is given by this command:

    df.groupby(['location']).apply(stats)
    

    With the output being:

        total_oranges   percentage_oranges  melons
    location            
    backyard    3.0 0.666667    270.0
    bank    2.0 0.000000    165.0
    store   2.0 1.000000    146.0
    

    I have also tested this with multiple .groupby() columns, and it seems to work fine. One thing of note is the use of the .mean() function. I am sure there is a better way to extract the calculated value for each subgroup, but this is a decent workaround for now.


    Lastly, I would like to mention that this approach requires additional tweaking if the column entries are not both float/int. For example, defining the is_orange and is_non_orange columns as Booleans:

    df = pd.DataFrame({
        'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
        'is_orange': [True, True, False, False, True, False, True],
        'is_non_orange': [False, False, True, True, False, True, False],
        'melons': [73, 81, 94, 174, 23, 71, 65]
    })
    

    And attempting to run the same logic as before:

    def stats(df_subgroup):
        return pd.Series({
        'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
        'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
        'melons': (df_subgroup['melons']).sum()
    })
        
    df.groupby(['location']).apply(stats)
    

    Yields the following error: NotImplementedError: operator 'truediv' not implemented for bool dtypes.

    Instead, we modify the logic to be:

    def stats(df_subgroup):
        return pd.Series({
        'total_oranges' : (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum()).sum(),
        'percentage_oranges' : (df_subgroup['is_orange'].sum() / (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum())).mean(),
        'melons': (df_subgroup['melons']).sum()
    })
        
    df.groupby(['location']).apply(stats)
    

    (Note the use of .sum() at the end of each Boolean column call)

    This yields the desired output:

        total_oranges   percentage_oranges  melons
    location            
    backyard    3.0 0.666667    270.0
    bank    2.0 0.000000    165.0
    store   2.0 1.000000    146.0