Search code examples
pythonprobabilitydistributionpercentilekernel-density

Calculate percentiles if we have probability density function data as x and y


I have data extracted from a pdf graph where x represents incubation times and y is the density in a csv file. I would like to calculate the percentiles, such as 95%. I'm a bit confused, should I calculate the percentile using the x values only, i.e., using np.precentile(x, 0.95)?

data in plot: enter image description here


Solution

  • Here is some code which uses np.trapz (as proposed by @pjs). We take x and y arrays, assume it is PDF so first we normalize it to 1, an then start searching backward till we hit 0.95 point. I've made up some multi-peak function

    import numpy as np
    import matplotlib.pyplot as plt
    
    N = 1000
    
    x = np.linspace(0.0, 6.0*np.pi, N)
    y = np.sin(x/2.0)/x # construct some multi-peak function
    y[0] = y[1]
    y = np.abs(y)
    
    plt.plot(x, y, 'r.')
    plt.show()
    
    # normalization
    norm = np.trapz(y, x)
    print(norm)
    
    y = y/norm
    print(np.trapz(y, x)) # after normalization
    
    # now compute integral cutting right limit down by one
    # with each iteration, stop as soon as we hit 0.95
    for k in range(0, N):
        if k == 0:
            xx = x
            yy = y
        else:
            xx = x[0:-k]
            yy = y[0:-k]
        v = np.trapz(yy, xx)
        print(f"Integral {k} from {xx[0]} to {xx[-1]} is equal to {v}")
        if v <= 0.95:
            break