Search code examples
pythonnumpymatplotlibstatisticsprobability

How to compute and plot the pdf from the empirical cdf?


I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:

plt.plot(xvalues, yvalues)
plt.show()

I assume the data needs to be smoothed somehow in order to give a smooth pdf.

enter image description here

I would like to plot the pdf. How can I do that?

The raw data is at: http://dpaste.com/1HVK5DR .


Solution

  • There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.

    So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)

    import matplotlib.pyplot as plt
    import numpy as np
    from data import xvalues, yvalues #load data from file
    
    
    print("#datapoints: {}".format(len(xvalues)))
    #don't use every point if your computer is not very fast
    xv = np.array(xvalues)[::5]
    yv = np.array(yvalues)[::5]
    
    #interpolate to have evenly space data
    xi = np.linspace(xv.min(), xv.max(), 400)
    yi = np.interp(xi, xv, yv)
    

    Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form

     c(t) = sum a(i) * phi(t - x(i))    #(not part of the program)
    

    where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.

    Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just

     c(t) = sum a(i) * phi'(t - x(i))
    

    where phi' is the derivative of phi. #(not part of the program)

    Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.

    #set up rbf network
    rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
    print("#rbfs: {}".format(rbf_nodes.shape[1]))
    #estimate width of kernels:
    sigma = 20  #greater = smoother, this is the primary parameter to play with
    sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
    
    
    # kernel & derivative
    rbf = lambda r:1/(1+(r/sigma)**2)
    Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
    
    #compute coefficients of rbf network
    r = np.abs(xi[:, None]-rbf_nodes)
    A = rbf(r)
    coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
    print(coeffs)
    
    #evaluate rbf network
    N=1000
    xe = np.linspace(xi.min(), xi.max(), N)
    Ae = rbf(xe[:, None] - rbf_nodes)
    ye = Ae @ coeffs
    
    
    #evaluate derivative
    N=1000
    xd = np.linspace(xi.min(), xi.max(), N)
    Bd = Drbf(xe[:, None] - rbf_nodes)
    yd = Bd @ coeffs
    
    
    fig,ax = plt.subplots()
    ax2 = ax.twinx()
    
    ax.plot(xv, yv, '-')
    ax.plot(xi, yi, '-')
    ax.plot(xe, ye, ':')
    
    ax2.plot(xd, yd, '-')
    
    fig.savefig('graph.png')
    print('done')
    
    

    enter image description here