Search code examples
pythonscipydistributioncdf

Fitting a theoretical distribution to a sampled empirical CDF with scipy stats


I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)

I want to find which distribution and with what parameters offers the closest fit to the CDF samples.

I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.

Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?

What alternatives could I use to find a matching known distribution?


Solution

  • I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.

    If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.

    We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.

    I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.

    n = 100
    x = np.linspace(-4,4,n)
    f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
    
    data = f(x,0.2,1) + 0.05*np.random.randn(n)
    

    Now, use curve_fit to find parameters.

    mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
    

    This gives output

    >> mu,sigma
    0.1828320963531838, 0.9452044983927278
    

    We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well. true CDF, noisy data, recovered CDF

    Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.