Search code examples
pythonpython-3.xnumpyscikit-learnscipy

Automate the calculation in Python of a Distribution Function in a system of equations


I would like to calculate an approximation of the data Distribution Function in a system of equations:

F(t) := P(X <= t) ~ sum_i_frequency(observation_i <=t) / total_observation =: f(t)

My data:

List_Goals: [1, 2, 2, 1, 2]  
Matches played: 5   

For example, if a football club, in the last 5 matches, scores [1, 2, 2, 1, 2], it means that:

- 0 goals, scored 0 times (nothing);
- 1 or less goals, scored 2 times (1, 1);
- 2 or fewer goals, scored 5 times (1, 2, 2, 1, 2);

If the goals scored are <= 0 events, then i will have 0/5 = 0;
If the goals scored are <= 1 events, then I will have 2/5 = 0.4;
If the goals scored are <= 2 events, then I will have 5/5 = 1;

f(0) = 0/5 = 0;
f(1) = 2/5= 0.4;
f(2) = 5/5= 1;

System: {f(0) = F(0) } therefore 0/5= 0
        {f(1) = F(1) } therefore 2/5= 0.4
        {f(2) = F(2) } therefore 5/5= 1

In this case i have three equations in one unknown, but obviously the system must be set up with a number of equations ranging from F(0) to F(max_goals_scored). This way I would have max_goals_scored + 1 equations. I should therefore start the solution starting from the max_goals_scored+1 equations and increase, for example, if I had a number 6 in List_Goals (List_Goals: [1, 2, 6, 1, 2], the functions would be with a maximum of 6: F(0), F(1), F(2), F(3), F(4), F(5), F(6)

How can I automate everything in Python? I accept any library


Solution

  • Use bincount:

    import numpy as np
    Goals = [1, 2, 2, 1, 2] 
    f = np.bincount(Goals)/len(Goals)
    f
    array([0. , 0.4, 0.6])
    
    F = f.cumsum()
    array([0. , 0.4, 1. ])
    

    Note that f is the probability mass function while F is the cummulative mass function. The two are not the same