Search code examples
pythonpandasdataframegroup-by

pd.groupby in two columns sum() function does not work


I am trying to do the next example. I want to get the total number of survivors (survived) and the total amount paid per class (fared) using the Titanic dataset.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bvalgard/data/main/titanic.csv')

df.groupby(['pclass'])[['survived', 'fare']].sum()

When I run this code, I only get the total number of survivors, but not the total amount paid. However, if I use other functions such as .min(), .max() etc it works.

What is the problem with the .sum() function then?


Solution

  • This is because you don't have numerical values but objects (df['fare'].dtype gives dtype('O'), df['survived'].dtype gives dtype('int64')), and groupby.sum silently drops invalid columns. You must convert to_numeric:

    (df.assign(fare=pd.to_numeric(df['fare'], errors='coerce'))
       .groupby(['pclass'])[['survived', 'fare']].sum()
    )
    

    Output:

            survived        fare
    pclass                      
    1            200  28265.4043
    2            119   5866.6374
    3            181   9418.4452
    

    Why does it fail?

    groupby.sum only works with numeric data by default (numeric_only=True), while groupby.max has a default of False as it's more flexible to handle a wider range of data types (you can for instance compare strings, max(['A', 'B']) give 'B').

    Can't I use numeric_only=False?

    Yes, you can, but this would concatenate the strings rather than adding numerical values, which is not what you want:

    df.groupby(['pclass'])[['survived', 'fare']].sum(numeric_only=False)
    
            survived                                               fare
    pclass                                                             
    1            200  211.3375151.55151.55151.55151.5526.5577.958305...
    2            119  24241311.510.526261311.510.51310.512.525262626...
    3            181  7.5520.2520.2520.257.657.657.9257.22927.258.05...