Search code examples
pythonpandasgroup-by

panda groupby with lambda


I am trying to understand the panda group by function

import pandas as pd
data=[['a',3],['a',3],['b',1],['a',0],['b',0]]
df=pd.DataFrame(data,columns=['Room','Value'])
print(df)

sum_df=df.groupby(['Room']).agg(
 sumValue =('Value','sum'), 
 nonBlankOccasion =('Value', lambda x: x>0).count())
).reset_index()
print(sum_df)

I'm trying to get a result of room, sum(value) and number occasion on non-zero value. but I can't get the correct answer by using the code above. it gives me the count on all records by room

 Room  sumValue  nonBlankOccation
0    a         6                 3
1    b         1                 2

However, if I changed to


sum_df=df.groupby(['Room']).agg(
 sumValue =('Value','sum'),
 nonBlankOccasion =('Value', lambda x:(x>0).sum())
).reset_index()

it gave the correct result.

 Room  sumValue  nonBlankOccation
0    a         6                 2
1    b         1                 1

Can anyone help me understand why it is sum() not count() here? Thank you very much!

Elac

I tried both sum() and count() in the group by function. I was expecting count() to give me the correct answer, however sum() is the correct one. I'd like to understand why.


Solution

  • If you print out what is actually an argument of the lambda function, you will see that it's an object of pandas Series. Type is printed out for each group, therefore twice:

    sum_df=df.groupby(['Room']).agg(
        nonBlankOccasion =('Value', lambda x: print(type(x>0)))
    ).reset_index()
    
    # > <class 'pandas.core.series.Series'>
    # > <class 'pandas.core.series.Series'>
    

    You can then print out what exactly are you counting / summing:

    sum_df=df.groupby(['Room']).agg(
        nonBlankOccasion =('Value', lambda x: print(x>0))
    ).reset_index()
    
    # > 0     True
    # > 1     True
    # > 3    False
    # > Name: Value, dtype: bool
    # > 2     True
    # > 4    False
    # > Name: Value, dtype: bool
    

    Both of them are pandas Series with True / False values, both of group length.

    So when applying count you get the length of the series which is the size of the group. When summing you get correct results, because it's the same as the number of True values in the series, which you're looking for.