Let's take the following df as an example:
df = pd.DataFrame({'start_point':['Station_1', 'Station_2', 'Station_1', 'Station_3','Station_1', 'Station_1', 'Station_1', 'Station_3'], 'end_point':['Station_1', 'Station_2', 'Station_1', 'Station_2','Station_2', 'Station_3', 'Station_3', 'Station_1'], 'period_of_day':['morning', 'noon', 'morning', 'night','evening', 'night', 'morning', 'afternoon'], 'day_of_week':['0', '1', '2', '0','1', '0', '1', '0']})
I would like to calculate the conditional probability of a trip ending at an end_point
I manage to do this situation when I use only one condition. As an example, let's calculate the probability of a trip ending in a given end_point
, taking into account the day_of_week
df.groupby('end_point')['day_of_week'].value_counts() / df.groupby('day_of_week')['end_point'].count()
end_point day_of_week
Station_1 0 0.500000
2 1.000000
Station_2 1 0.666667
0 0.250000
Station_3 0 0.250000
1 0.333333
However, I'm having a hard time calculating this probability when I involve two or more conditions. How can I add period_of_day
also as a condition, for example?
Do to P(A|B) you should do:
df.groupby([B])[A].value_counts(normalize=True)
For example:
df.groupby('day_of_week')['end_point'].value_counts(normalize=True)
0 Station_1 0.500000
Station_2 0.250000
Station_3 0.250000
1 Station_2 0.666667
Station_3 0.333333
2 Station_1 1.000000
And for more than one column:
df.groupby(['day_of_week', 'period_of_day'])['end_point'].value_counts(normalize=True)
day_of_week period_of_day end_point
0 afternoon Station_1 1.0
morning Station_1 1.0
night Station_2 0.5
Station_3 0.5
1 evening Station_2 1.0
morning Station_3 1.0
noon Station_2 1.0
2 morning Station_1 1.0