Search code examples
pythonnumpyscipyinterpolationcdf

Trying to interpolate the output of a histogram function in Python


What I am trying to do is to play around with some random distribution. I don't want it to be normal. But for the time being normal is easier.

import matplotlib.pyplot as plt
from scipy.stats import norm

ws=norm.rvs(4.0, 1.5, size=100)
density, bins = np.histogram(ws, 50,normed=True, density=True)
unity_density = density / density.sum()

fig, ((ax1, ax2)) = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(12,6))
widths = bins[:-1] - bins[1:]

ax1.bar(bins[1:], unity_density, width=widths)
ax2.bar(bins[1:], unity_density.cumsum(), width=widths)

fig.tight_layout()

enter image description here Then what I can do it visualize CDF in terms of points.

density1=unity_density.cumsum()
x=bins[:-1]
y=density1

plt.plot(x, density1, 'o')

enter image description here

So what I have been trying to do is to use the np.interp function on the output of np.histogram in order to obtain a smooth curve representing the CDF and extracting the percent points to plot them. Ideally, I need to try to do it all both manually and using ppf function from scipy. I have always struggled with statistics as an undergraduate. I am in grad school now and try to put me through as many exercises like this as possible in order to get a deeper understanding of what is happening. I've reached a point of desperation with this task. Thank you!


Solution

  • One possibility to get smoother results is to use more samples, by using 10^5 samples and 100 bins I get the following images:

    ws = norm.rvs(loc=4.0, scale=1.5, size=100000)
    density, bins = np.histogram(ws, bins=100, normed=True, density=True)
    

    histogram histogram

    In general you could use scipys interpolation module to smooth your CDF. For 100 samples and a smoothing factor of s=0.01 I get:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.interpolate import splev, splrep
    
    density1 = unity_density.cumsum()
    x = bins[:-1]
    y = density1
    
    # Interpolation
    spl = splrep(x, y, s=0.01, per=False)
    x2 = np.linspace(x[0], x[-1], 200)
    y2 = splev(x2, spl)
    
    # Plotting
    fig, ax = plt.subplots()
    plt.plot(x, density1, 'o')
    plt.plot(x2, y2, 'r-')
    

    CDF of Norm(loc=4, scale=1.5) interpolated

    The third possibility is to calculate the CDF analytically. If you generate the noise yourself with a numpy / scipy function most of the time there is already an implementation of the CDF available, otherwise you should find it on Wikipedia. If your samples come from measurements that is of course a different story.

    import numpy as np
    from scipy.stats import norm
    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots()
    x = np.linspace(-2, 10)
    y = norm(loc=4.0, scale=1.5).cdf(x)
    ax.plot(x, y, 'bo-')
    

    CDF of Norm(loc=4, scale=1.5) analytically