Search code examples
pythonprobability

How to Algorithmically Generate a List of Probabilities


Please forgive my lack of statistical nomenclature.

I've been given an arbitrary list of values to sample, currently:
list_to_sample = [1, 2, 3, 4, 5].
At this point, it doesn't matter what the list contains, but that the length of list is 5.

And, I've been given a list of almost arbitrary "pareto-like" probabilities, currently:
probability_list = [0.5, 0.3, 0.1, 0.05, 0.05]
(pareto-like as it does not follow the 80-20, but rather 80/40 as the top 80% of probable selected values will be in the top 40% of the list.

I am now trying to generalize this, so that if list_to_sample gets longer, like:
[1, 2, 3, 4, 5, 6, 7, 8]
I can extend the probability_list and maintain the same curve.

I am trying to use np.pareto.pdf to produce a list of probabilities that is similar to:
[0.5, 0.3, 0.1, 0.05, 0.05]
and where the sum of the list (the sum of the probabilities) equals 1.

Specifically, I have tried this:

import numpy as np

list_to_sample = [1, 2, 3, 4, 5]
output = np.array([pareto.pdf(x=list_to_sample, b=1, loc=0, scale=1)])

Output:

[[0.5        0.125      0.05555556 0.03125    0.02      ]]

I have tried changing parameters to no avail. I was hopeful that by changing parameters I could get pareto to produce the desired result. So far, no luck.

Perhaps there is a better function to produce (or extend) a list of probabilities.


Solution

  • Do you need to use the Pareto distribution? If so, I don´t think this problem is well-defined as the items in list_sample will matter and I don´t see from your question how you can define all the parameters of the Pareto distribution.

    If you can use other techniques, I would go with a simple interpolation, for example the cubic spline. Since you said the values in the list don´t matter, we can work with the percentage values instead.

    import numpy as np
    import scipy as sp
    
    list_to_sample = [1, 2, 3, 4, 5]
    probability_list = [0.5, 0.3, 0.1, 0.05, 0.05]
    
    # --- adding zero at the beginning to ensure the we map zero to zero
    
    x = np.array([0] + list_to_sample) / len(list_to_sample)
    y = np.array([0] + probability_list).cumsum()
    
    print("x:", x)  # -- [0.0  0.2  0.40  0.60  0.80  1.0]
    print("y:", y)  # -- [0.0  0.5  0.80  0.90  0.95  1.0]
    
    # - spline
    
    spline = sp.interpolate.CubicSpline(x, y)
    
    new_values = np.arange(1, 11)
    cprobs = spline(new_values / len(new_values))
    
    print("New values:", new_values)
    print("Cumulative probabilities:", cprobs)
    
    # -- the top 40% still has an overall 80% probability,
    # -- the output below is rounded
    
    # -- [   1    2    3    4    5    6    7    8    9   10]
    # -- [0.27 0.50 0.68 0.80 0.87 0.90 0.93 0.95 0.97 1.00]
    
    
    # - to get the probability for each value we just diff cprobs
    
    probs = np.diff([0] + list(cprobs))
    print("Probabilities:", probs)
    
    # -- [0.272 0.228 0.178 0.122 0.067 0.034 0.026 0.024 0.024 0.026]