Search code examples
pandasdataframegroup-byaggregate

Filter dataframe based on condition before groupby


Suppose I have a dataframe like this

enter image description here

Create sample dataframe:

import pandas as pd
import numpy as np

data = {
    'gender': np.random.choice(['m', 'f'], size=100),
    'vaccinated': np.random.choice([0, 1], size=100),
    'got sick': np.random.choice([0, 1], size=100)
}

df = pd.DataFrame(data)

and I want to see, by gender, what proportion of vaccinated people got sick.

I've tries something like this:

df.groupby('gender').agg(lambda group: sum(group['vaccinated']==1 & group['sick']==1)
                                          /sum(group['sick']==1))

but this doesn't work because agg works on the series level. Same applies for transform. apply doesn't work either, but I'm not as clear why or how apply functions on groupby objects.

Any ideas how to accomplish this with a single line of code?


Solution

  • You could first filter for the vaccinated people and then group by gender and calculate the proportion of people that got sick..

    df[df.vaccinated == 1].groupby("gender").agg({"got sick":"mean"})
    

    Output:

            got sick
    gender          
    f       0.548387
    m       0.535714
    

    In this case the proportion is calculated based on a sample data that I've created