pd.groupby in two columns sum() function does not work

I am trying to do the next example. I want to get the total number of survivors (survived) and the total amount paid per class (fared) using the Titanic dataset.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bvalgard/data/main/titanic.csv')

df.groupby(['pclass'])[['survived', 'fare']].sum()

When I run this code, I only get the total number of survivors, but not the total amount paid. However, if I use other functions such as .min(), .max() etc it works.

What is the problem with the .sum() function then?

Solution

This is because you don't have numerical values but objects (df['fare'].dtype gives dtype('O'), df['survived'].dtype gives dtype('int64')), and groupby.sum silently drops invalid columns. You must convert to_numeric:

(df.assign(fare=pd.to_numeric(df['fare'], errors='coerce'))
   .groupby(['pclass'])[['survived', 'fare']].sum()
)

Output:

        survived        fare
pclass                      
1            200  28265.4043
2            119   5866.6374
3            181   9418.4452

Why does it fail?

groupby.sum only works with numeric data by default (numeric_only=True), while groupby.max has a default of False as it's more flexible to handle a wider range of data types (you can for instance compare strings, max(['A', 'B']) give 'B').

Can't I use `numeric_only=False`?

Yes, you can, but this would concatenate the strings rather than adding numerical values, which is not what you want:

df.groupby(['pclass'])[['survived', 'fare']].sum(numeric_only=False)

        survived                                               fare
pclass                                                             
1            200  211.3375151.55151.55151.55151.5526.5577.958305...
2            119  24241311.510.526261311.510.51310.512.525262626...
3            181  7.5520.2520.2520.257.657.657.9257.22927.258.05...

pd.groupby in two columns sum() function does not work

Why does it fail?

Can't I use numeric_only=False?

Can't I use `numeric_only=False`?