Search code examples
pythonpandasgroup-bysumcounter

How to calculate percentage of each class variable after grouping it by another variable?


I have a data that initially looks like this:

column1 Time
yes 271-273
no 271-273
neutral 271-273
no 274-276
... ...

I want the percentage of yes, percentage of no, and percentage of neutral across each Time range. I was able to get the count of each one of these categories (yes,no,neutral) across every time range using the following code:

df['COUNTER'] =1    
group_data = df.groupby(['Time','column1'])['COUNTER'].sum() 

I am not sure how to calculate the percentage of counts from this.


Solution

  • Use SeriesGroupBy.value_counts with parameter normalize=True:

    print (df)
       column1     Time
    0      yes  271-273
    1       no  271-273
    2  neutral  271-273
    3       no  271-273 <- changed data for better sample
    
    group_data = (df.groupby(['Time'])['column1']
                    .value_counts(normalize=True)
                    .reset_index(name='%') )
    
    print (group_data)
          Time  column1     %
    0  271-273       no  0.50
    1  271-273  neutral  0.25
    2  271-273      yes  0.25
    

    Another idea with divide by sum of counts with DataFrameGroupBy.size and Series.div:

    s = df.groupby(['Time','column1']).size()
    group_data = s.div(s.groupby(level=0).sum()).reset_index(name='%') 
    print (group_data)
    
          Time  column1     %
    0  271-273  neutral  0.25
    1  271-273       no  0.50
    2  271-273      yes  0.25