Search code examples
pythonpandasprobability

pandas: calculate probability group by


I'm unable to understand the output when calculating probability for a group by use-case. I'm interested to calculate probability, for example, in the below data frame, grouped by a1 probability of a2

import pandas as pd 
df = pd.DataFrame([[1,1,0],[0,1,1],[0,1,1],[1,1,0],[1,1,0],[1,0,0]],
                  columns=['a1','a2','a3'])

df[["a1","a2"]].groupby('a1').apply(lambda x: x[x>0].count()/len(x)) 

I get output as:

a1 a2

a1
0 0.0 1.00 1 1.0 0.75

The probability column should add up as 1. I cannot understand why for columns a2 the addition of total probability is 1.75. Second, how do I format the output from python in the tabular format as needed by stackoverflow.

Following link gives mean: https://stackoverflow.com/a/43015011/2740831 However, if IIUC probability is based upon the count of event occurance.


Solution

  • In your ouput is 0.75, not 1.75 - solution should be simplify with mean by boolean DataFrame:

    df1 = df["a2"].gt(0).groupby(df['a1']).mean().reset_index(name='prob')
    print (df1)
       a1  prob
    0   0  1.00
    1   1  0.75
    
    
    df2 = df[["a1","a2"]].gt(0).groupby(df['a1']).mean()
    print (df2)
         a1    a2
    a1           
    0   0.0  1.00
    1   1.0  0.75