Search code examples
pythonstatisticsprobabilitystandard-deviation

How to work out the probability using standard deviation in python


I have a list of students. I have their attendance rates for last year and their current attendance for this year so far. I am trying to work out the probability of them being less than 90% by the end of the year. Below is my data:

Name  %LastYear AttendedSoFarThisYear  SessionsSoFarThisYear  %SoFarThisYear SessionsNeeded  SessionsLeft
Ethan 97%       218                    232                    94%            52              68
Molly 91%       202                    232                    87%            101             68
Henry 95%       226                    232                    97%            44              68

So at the moment I am working it out by doing SessionsNeeded divided by SessionsLeft, then multiplying by 100. So for Ethan that is 76% likely, Molly is 148% likely and Henry is 65% likely to get below 90%.

However I don't think this way of working out the probability is very fair because at the start of the year, everyone will have a very high probability percentage as they haven't completed any sessions. But really I want it to take into account their previous year attendance, so that at the start of the year, Molly will have a higher probability than Ethan.

It also needs to take into account their current attendance rates so far. As Henry is actually attending 97% of the time, it is probable he will continue to do so. Whereas Molly is just under 87%, it is not likely she will catch up.

Does anyone have any ideas of I could work this out using this data? Preferably in python, or even in excel?


Solution

  • You could weight the importance of previous and current year attendance based on the completion of the current year.

    completion = SessionsSoFarThisYear/TotalSessionsInThisYear

    The attendance probability for a student would be equal to:

    P_attendence(student) = LastYearAttendence(student) * (1 - completion) + ThisYearAttendence(student) * completion

    Edit:

    Next step would be to get the difference between SessionsNeeded and SessionsLeft corrected by student's attendance P_attendance:

    delta_days(student) = (SessionsLeft * P_attendance(student)) - SessionsNeeded(student)

    Finally, to get the probability of a student fulfilling the attendance requirement of 90%, use a clamped logistic function with L = 1 and x0 = 0:

    Logistic Function

    P_passing(student) = 0% if delta_days(student) <= 0, else log_f(delta_days(student))

    For all negative delta_days it returns 0% probability of fulfillment, meaning that with the current estimate of student's attendance, there is no chance of passing. The higher the attendance (estimated) and the more days left to skip the higher probability of passing this year.

    Edit2: The 90% threshold is hidden in SessionsNeeded used to calculate delta_days.