I need to be able to define, in one groupby/agg statement, the sum & the division of two separate columns and save them as a new column, and then calculate the sum of another column.
Mods, please examine my desired output first before reviewing this question. Looking at other questions on this site, I could only figure out how to write custom aggregation functions for one aggregation column, not multiple.
For example, Python - Pandas data frame: customized aggregation function after groupy? asks a similar question, but provides no advice as to how to define custom functions for multiple columns, simultaneously.
Below is a problem statement, my attempt at a solution, and my desired result.
Here is my dataframe:
df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [1, 1, 0, 0, 1, 0, 1],
'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
'melons': [73, 81, 94, 174, 23, 71, 65]})
I would like to do something like this:
df.groupby(['location']).agg(
'total orange/non-orange' : df['is_orange'] + df['is_non_orange'],
'percent_orange' : df['is_orange'] / (df['is_orange'] + df['is_non_orange']),
'sum_melons' : sum(df['melons']))
With the desired output being:
df =
location total_orange/non-orange percentage_oranges melons
backyard 3.0 0.66 270
bank 2.0 0.00 165
store 2.0 1.00 146
Thanks in advance.
Another answer (inspired by discussions with Ry on the Python Discord server, and my own testing):
Define the dataframe:
df = pd.DataFrame({
'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [1, 1, 0, 0, 1, 0, 1],
'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
'melons': [73, 81, 94, 174, 23, 71, 65]
})
We then define a function that will be performed over each of the subgroups defined by the eventual .groupby()
statement.
def stats(df_subgroup):
return pd.Series({
'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
'melons': (df_subgroup['melons']).sum()
})
The output is given by this command:
df.groupby(['location']).apply(stats)
With the output being:
total_oranges percentage_oranges melons
location
backyard 3.0 0.666667 270.0
bank 2.0 0.000000 165.0
store 2.0 1.000000 146.0
I have also tested this with multiple .groupby()
columns, and it seems to work fine. One thing of note is the use of the .mean()
function. I am sure there is a better way to extract the calculated value for each subgroup, but this is a decent workaround for now.
Lastly, I would like to mention that this approach requires additional tweaking if the column entries are not both float/int. For example, defining the is_orange
and is_non_orange
columns as Booleans:
df = pd.DataFrame({
'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [True, True, False, False, True, False, True],
'is_non_orange': [False, False, True, True, False, True, False],
'melons': [73, 81, 94, 174, 23, 71, 65]
})
And attempting to run the same logic as before:
def stats(df_subgroup):
return pd.Series({
'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
'melons': (df_subgroup['melons']).sum()
})
df.groupby(['location']).apply(stats)
Yields the following error: NotImplementedError: operator 'truediv' not implemented for bool dtypes
.
Instead, we modify the logic to be:
def stats(df_subgroup):
return pd.Series({
'total_oranges' : (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum()).sum(),
'percentage_oranges' : (df_subgroup['is_orange'].sum() / (df_subgroup['is_non_orange'].sum() + df_subgroup['is_orange'].sum())).mean(),
'melons': (df_subgroup['melons']).sum()
})
df.groupby(['location']).apply(stats)
(Note the use of .sum()
at the end of each Boolean column call)
This yields the desired output:
total_oranges percentage_oranges melons
location
backyard 3.0 0.666667 270.0
bank 2.0 0.000000 165.0
store 2.0 1.000000 146.0