Search code examples
pythonpandasdataframegroup-by

Groupby count only when a certain value is present in one of the column in pandas


I have a dataframe similar to the below mentioned database:

+------------+-----+--------+
|    time    | id  | status |
+------------+-----+--------+
| 1451606400 | id1 | Yes    |
| 1451606400 | id1 | Yes    |
| 1456790400 | id2 | No     |
| 1456790400 | id2 | Yes    |
| 1456790400 | id2 | No     |
+------------+-----+--------+

I'm grouping by all the columns mentioned above and i'm able to get the count in a different column named 'count' successfully using the below command:

df.groupby(['time','id', 'status']).size().reset_index(name='count')

But I want the count in the above dataframe only in those rows with status = 'Yes' and rest should be '0'

Desired Output:

+------------+-----+--------+---------+
|    time    | id  | status | count   |
+------------+-----+--------+---------+
| 1451606400 | id1 | Yes    |       2 |
| 1456790400 | id2 | Yes    |       1 |
| 1456790400 | id2 | No     |       0 |
+------------+-----+--------+---------+

I tried to count for status = 'Yes' with the below code:

df[df['status']== 'Yes'].groupby(['time','id','status']).size().reset_index(name='count')

which obviously gives me those rows with status = 'Yes' and discarded the rest. I want the discarded ones with count = 0

Is there any way to get the result?

Thanks in advance!


Solution

  • Use lambda function with apply and for count sum boolena True values proccesses like 1:

    df1 = (df.groupby(['time','id','status'])
             .apply(lambda x: (x['status']== 'Yes').sum())
             .reset_index(name='count'))
    

    Or create new column and aggregate sum:

    df1 = (df.assign(A=df['status']=='Yes')
             .groupby(['time','id','status'])['A']
             .sum()
             .astype(int)
             .reset_index(name='count'))
    

    Very similar solution with no new column, but worse readable a bit:

    df1 = ((df['status']=='Yes')
            .groupby([df['time'],df['id'],df['status']])
            .sum()
            .astype(int)
            .reset_index(name='count'))
    
    print (df)
             time   id status  count
    0  1451606400  id1    Yes      2
    1  1456790400  id2     No      0
    2  1456790400  id2    Yes      1