I am trying to do the next example. I want to get the total number of survivors (survived) and the total amount paid per class (fared) using the Titanic dataset.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bvalgard/data/main/titanic.csv')
df.groupby(['pclass'])[['survived', 'fare']].sum()
When I run this code, I only get the total number of survivors, but not the total amount paid. However, if I use other functions such as .min()
, .max()
etc it works.
What is the problem with the .sum()
function then?
This is because you don't have numerical values but objects (df['fare'].dtype
gives dtype('O')
, df['survived'].dtype
gives dtype('int64')
), and groupby.sum
silently drops invalid columns. You must convert to_numeric
:
(df.assign(fare=pd.to_numeric(df['fare'], errors='coerce'))
.groupby(['pclass'])[['survived', 'fare']].sum()
)
Output:
survived fare
pclass
1 200 28265.4043
2 119 5866.6374
3 181 9418.4452
groupby.sum
only works with numeric data by default (numeric_only=True
), while groupby.max
has a default of False
as it's more flexible to handle a wider range of data types (you can for instance compare strings, max(['A', 'B'])
give 'B'
).
numeric_only=False
?Yes, you can, but this would concatenate the strings rather than adding numerical values, which is not what you want:
df.groupby(['pclass'])[['survived', 'fare']].sum(numeric_only=False)
survived fare
pclass
1 200 211.3375151.55151.55151.55151.5526.5577.958305...
2 119 24241311.510.526261311.510.51310.512.525262626...
3 181 7.5520.2520.2520.257.657.657.9257.22927.258.05...