Search code examples
pythonmatplotlibstatisticsprobabilityprobability-distribution

Generate probability distribution or smoothing plot from points containing probabilities


I have points which include the probability on the y-axis and values on the x-axis, like:

p1 =
[[0.0, 0.0001430560406790707],
[10.0, 6.2797052001508247e-13],
[15.0, 4.8114669550502021e-06],
[20.0, 0.0007443231772534647],
[25.0, 0.00061070912573869406],
[30.0, 0.48116582167944905],
[35.0, 0.24698643991977953],
[40.0, 0.016407283121225951],
[45.0, 0.2557158314329116],
[50.0, 1.1252231121357235e-05],
[55.0, 0.064666668633158647],
[60.0, 1.7631447655837744e-17],
[65.0, 1.1294722466816786e-14],
[70.0, 2.9419020411134367e-16],
[75.0, 3.0887653014525822e-17],
[80.0, 4.4973693062706866e-17],
[85.0, 9.0975358174005147e-15],
[90.0, 1.0758266454985257e-10],
[95.0, 7.2923752473657924e-08],
[100.0, 1.8065366882584036e-08]]

p2 =
[[0.0, 4.1652247577331996e-06],
[10.0, 1.2212829713673957e-06],
[15.0, 6.5906857192417344e-08],
[20.0, 0.00016745946587138236],
[25.0, 0.0054431111796765554],
[30.0, 0.0067575214586160616],
[35.0, 0.00011856110316632124],
[40.0, 0.00032181662132509944],
[45.0, 0.001397981055516994],
[50.0, 0.0027058954834684062],
[55.0, 2.553142406703067e-06],
[60.0, 1.1514033594755017e-08],
[65.0, 0.21961568282994792],
[70.0, 2.4658349829099807e-08],
[75.0, 0.0022850986575076743],
[80.0, 3.5603047823624507e-06],
[85.0, 0.99406392082894734],
[90.0, 0.24399923235645221],
[95.0, 0.0013470125217945798],
[100.0, 0.042582366972883985]] 

Now I want to generate a probability distribution from the points, where the x-axis values are (0,10,15,20,...,100) and the y-axis values contain the probabilities (0.00014,....)

When using the plt.plot fuction I get:

plt.plot([item[0] for item in p1],[item[1] for item in p1])

enter image description here

And for p2:

plt.plot([item[0] for item in p2],[item[1] for item in p2])

enter image description here

I want to get a more smooth visualization, like a probability distribution:

enter image description here

And if a probability distribution is not possible, then a smoothing spline:

enter image description here


Solution

  • Scipy's gaussian_kde is often used to smoothly approximate a probability distribution. It sums a gaussian kernel for each input point. Usually individual measurements are used as inputs, but the weights parameter allows working with binned data. The function is normalized to have its integral equal to one.

    This approach assumes the values of p1 and p2 are meant as a mean for the segment around each x-value, similar to a histogram. I.e. a step function where the x-values identify the end of each step.

    from matplotlib import pyplot as plt
    import numpy as np
    from scipy.stats import gaussian_kde
    
    p1 = np.array([[0.0, 0.0001430560406790707],
                   [10.0, 6.2797052001508247e-13],
                   [15.0, 4.8114669550502021e-06],
                   [20.0, 0.0007443231772534647],
                   [25.0, 0.00061070912573869406],
                   [30.0, 0.48116582167944905],
                   [35.0, 0.24698643991977953],
                   [40.0, 0.016407283121225951],
                   [45.0, 0.2557158314329116],
                   [50.0, 1.1252231121357235e-05],
                   [55.0, 0.064666668633158647],
                   [60.0, 1.7631447655837744e-17],
                   [65.0, 1.1294722466816786e-14],
                   [70.0, 2.9419020411134367e-16],
                   [75.0, 3.0887653014525822e-17],
                   [80.0, 4.4973693062706866e-17],
                   [85.0, 9.0975358174005147e-15],
                   [90.0, 1.0758266454985257e-10],
                   [95.0, 7.2923752473657924e-08],
                   [100.0, 1.8065366882584036e-08]])
    p2 = np.array([[0.0, 4.1652247577331996e-06],
                   [10.0, 1.2212829713673957e-06],
                   [15.0, 6.5906857192417344e-08],
                   [20.0, 0.00016745946587138236],
                   [25.0, 0.0054431111796765554],
                   [30.0, 0.0067575214586160616],
                   [35.0, 0.00011856110316632124],
                   [40.0, 0.00032181662132509944],
                   [45.0, 0.001397981055516994],
                   [50.0, 0.0027058954834684062],
                   [55.0, 2.553142406703067e-06],
                   [60.0, 1.1514033594755017e-08],
                   [65.0, 0.21961568282994792],
                   [70.0, 2.4658349829099807e-08],
                   [75.0, 0.0022850986575076743],
                   [80.0, 3.5603047823624507e-06],
                   [85.0, 0.99406392082894734],
                   [90.0, 0.24399923235645221],
                   [95.0, 0.0013470125217945798],
                   [100.0, 0.042582366972883985]])
    x = np.linspace(0, 100, 1000)
    fig, axes = plt.subplots(ncols=2)
    for ax, p in zip(axes, [p1, p2]):
        p[0, 0] = 5.0  # let each x-value be the end of a segment
        ax.step(p[:,0], p[:,1], color='dodgerblue', lw=1, ls=':', where='pre')
        ax2 = ax.twinx()
        kde = gaussian_kde(p[:,0]-2.5, bw_method=.25, weights=p[:,1])
        ax2.plot(x, kde(x), color='crimson')
    plt.show()
    

    result