Search code examples
pythonstatisticshistogramprobability

How to get a list of probabilities after creating a histogram for continuous data (Python)?


I have the data set below (Data) and I create a histogram using the code below to extract n (number of points in each bin or frequency). Then I calculate the probability of each of the bins by dividing frequency by total number of points to get the respective probability of each bin (bin_probability).

Now I want to get the probability for each point in a list. For example say point 1 is in bin 1 therefore, probability is the first value in the array of 0.65; point 2 is in bin 5 so probability is 0.05, etc. How do I map each point to its respective bin_probability so that I have a list of probabilities for each point (in this case 20 probabilities)?

Data = [4.33, 4.11, 6.33, 5.67, 3.24, 6.74, 24.6, 6.43, 4.122, 9.67, 9.99, 3.44, 5.66, 3.54, 5.34, 6.55, 5.78, 3.56, 1.55, 5.45]

n, bin_edges = np.histogram(Data, bins = 10)
totalcount = np.sum(n)
bin_probability = n / totalcount
print(bin_probability)
>> array([0.65, 0.3 , 0.  , 0.  , 0.05])

Many thanks for your help!


Solution

  • Based on @kcsquared's link above, a list can be made with the respective bin locations for each point. The variable 'bins_per_point' includes 20 elements in an array. Each element corresponds to bin the data point is part of. Next the 'probability_perpoint variable divides each frequency by the total count to get the respective probabilities.

    bins_per_point = np.fmin(np.digitize(Data, bin_edges), len(bin_edges)-1)
    probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(Data))] 
    >> array([0.1 , 0.1 , 0.15, 0.1 , 0.05, 0.15, 0.55, 0.15, 0.1 , 0.2 , 0.2 ,
           0.05, 0.1 , 0.05, 0.1 , 0.15, 0.1 , 0.05, 0.05, 0.1 ])
    

    To verify, the sum of unique probabilities is 1.

     np.sum(bin_probability) 
    >> 1