python-3.x pandas dataframe group-by aggregate

[pandas]: Groupby on column(s), then define multiple (including some custom) aggregation functions simultaneously

I need to be able to define, in one groupby/agg statement, the sum & the division of two separate columns and save them as a new column, and then calculate the sum of another column.

Mods, please examine my desired output first before reviewing this question. Looking at other questions on this site, I could only figure out how to write custom aggregation functions for one aggregation column, not multiple.

For example, Python - Pandas data frame: customized aggregation function after groupy? asks a similar question, but provides no advice as to how to define custom functions for multiple columns, simultaneously.

Below is a problem statement, my attempt at a solution, and my desired result.

Here is my dataframe:

df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
                   'is_orange': [1, 1, 0, 0, 1, 0, 1],
                   'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
                   'melons':     [73, 81, 94, 174, 23, 71, 65]})

I would like to do something like this:

df.groupby(['location']).agg(
    'total orange/non-orange' : df['is_orange'] + df['is_non_orange'],
    'percent_orange'          : df['is_orange'] / (df['is_orange'] + df['is_non_orange']),
    'sum_melons'              : sum(df['melons']))

With the desired output being:

df = 

location    total_orange/non-orange    percentage_oranges    melons
backyard                        3.0                  0.66       270
bank                            2.0                  0.00       165
store                           2.0                  1.00       146

Thanks in advance.

Solution

Another answer (inspired by discussions with Ry on the Python Discord server, and my own testing):

Define the dataframe:

df = pd.DataFrame({
    'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
    'is_orange': [1, 1, 0, 0, 1, 0, 1],
    'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
    'melons': [73, 81, 94, 174, 23, 71, 65]
})

We then define a function that will be performed over each of the subgroups defined by the eventual .groupby() statement.

def stats(df_subgroup):
    return pd.Series({
    'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
    'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
    'melons': (df_subgroup['melons']).sum()
})

The output is given by this command:

df.groupby(['location']).apply(stats)

With the output being:

    total_oranges   percentage_oranges  melons
location            
backyard    3.0 0.666667    270.0
bank    2.0 0.000000    165.0
store   2.0 1.000000    146.0

I have also tested this with multiple .groupby() columns, and it seems to work fine. One thing of note is the use of the .mean() function. I am sure there is a better way to extract the calculated value for each subgroup, but this is a decent workaround for now.

Lastly, I would like to mention that this approach requires additional tweaking if the column entries are not both float/int. For example, defining the is_orange and is_non_orange columns as Booleans:

df = pd.DataFrame({
    'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
    'is_orange': [True, True, False, False, True, False, True],
    'is_non_orange': [False, False, True, True, False, True, False],
    'melons': [73, 81, 94, 174, 23, 71, 65]
})

And attempting to run the same logic as before:

def stats(df_subgroup):
    return pd.Series({
    'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
    'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
    'melons': (df_subgroup['melons']).sum()
})
    
df.groupby(['location']).apply(stats)

Yields the following error: NotImplementedError: operator 'truediv' not implemented for bool dtypes.

Instead, we modify the logic to be:

def stats(df_subgroup):
    return pd.Series({
    'total_oranges' : (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum()).sum(),
    'percentage_oranges' : (df_subgroup['is_orange'].sum() / (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum())).mean(),
    'melons': (df_subgroup['melons']).sum()
})
    
df.groupby(['location']).apply(stats)

(Note the use of .sum() at the end of each Boolean column call)

This yields the desired output:

    total_oranges   percentage_oranges  melons
location            
backyard    3.0 0.666667    270.0
bank    2.0 0.000000    165.0
store   2.0 1.000000    146.0