Search code examples
pythonprobability

Calculating multiple (2 or more variables) conditional probability in python


Let's take the following df as an example:

df = pd.DataFrame({'start_point':['Station_1', 'Station_2', 'Station_1', 'Station_3','Station_1', 'Station_1', 'Station_1', 'Station_3'], 'end_point':['Station_1', 'Station_2', 'Station_1', 'Station_2','Station_2', 'Station_3', 'Station_3', 'Station_1'], 'period_of_day':['morning', 'noon', 'morning', 'night','evening', 'night', 'morning', 'afternoon'], 'day_of_week':['0', '1', '2', '0','1', '0', '1', '0']})

I would like to calculate the conditional probability of a trip ending at an end_point

I manage to do this situation when I use only one condition. As an example, let's calculate the probability of a trip ending in a given end_point, taking into account the day_of_week

df.groupby('end_point')['day_of_week'].value_counts() / df.groupby('day_of_week')['end_point'].count()

end_point  day_of_week
Station_1  0              0.500000
           2              1.000000
Station_2  1              0.666667
           0              0.250000
Station_3  0              0.250000
           1              0.333333

However, I'm having a hard time calculating this probability when I involve two or more conditions. How can I add period_of_day also as a condition, for example?


Solution

  • Do to P(A|B) you should do:

    df.groupby([B])[A].value_counts(normalize=True)
    

    For example:

    df.groupby('day_of_week')['end_point'].value_counts(normalize=True)
    
    0            Station_1    0.500000
                 Station_2    0.250000
                 Station_3    0.250000
    1            Station_2    0.666667
                 Station_3    0.333333
    2            Station_1    1.000000
    

    And for more than one column:

    df.groupby(['day_of_week', 'period_of_day'])['end_point'].value_counts(normalize=True)
    
    day_of_week  period_of_day  end_point
    0            afternoon      Station_1    1.0
                 morning        Station_1    1.0
                 night          Station_2    0.5
                                Station_3    0.5
    1            evening        Station_2    1.0
                 morning        Station_3    1.0
                 noon           Station_2    1.0
    2            morning        Station_1    1.0