I am trying to understand the panda group by function
import pandas as pd
data=[['a',3],['a',3],['b',1],['a',0],['b',0]]
df=pd.DataFrame(data,columns=['Room','Value'])
print(df)
sum_df=df.groupby(['Room']).agg(
sumValue =('Value','sum'),
nonBlankOccasion =('Value', lambda x: x>0).count())
).reset_index()
print(sum_df)
I'm trying to get a result of room, sum(value)
and number occasion on non-zero value.
but I can't get the correct answer by using the code above. it gives me the count on all records by room
Room sumValue nonBlankOccation
0 a 6 3
1 b 1 2
However, if I changed to
sum_df=df.groupby(['Room']).agg(
sumValue =('Value','sum'),
nonBlankOccasion =('Value', lambda x:(x>0).sum())
).reset_index()
it gave the correct result.
Room sumValue nonBlankOccation
0 a 6 2
1 b 1 1
Can anyone help me understand why it is sum()
not count()
here?
Thank you very much!
Elac
I tried both sum()
and count()
in the group by function. I was expecting count()
to give me the correct answer, however sum()
is the correct one. I'd like to understand why.
If you print out what is actually an argument of the lambda function, you will see that it's an object of pandas Series. Type is printed out for each group, therefore twice:
sum_df=df.groupby(['Room']).agg(
nonBlankOccasion =('Value', lambda x: print(type(x>0)))
).reset_index()
# > <class 'pandas.core.series.Series'>
# > <class 'pandas.core.series.Series'>
You can then print out what exactly are you counting / summing:
sum_df=df.groupby(['Room']).agg(
nonBlankOccasion =('Value', lambda x: print(x>0))
).reset_index()
# > 0 True
# > 1 True
# > 3 False
# > Name: Value, dtype: bool
# > 2 True
# > 4 False
# > Name: Value, dtype: bool
Both of them are pandas Series with True / False values, both of group length.
So when applying count
you get the length of the series which is the size of the group.
When summing you get correct results, because it's the same as the number of True values in the series, which you're looking for.