Search code examples
pythonscipynormal-distribution

Curve fiting of normal distribution in Python


I want to calculate the percentiles of normal distribution data, so I first fit the data to the normal distribution, here is the example:

from scipy.stats import norm
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

x = np.array([ 0.47712125,  0.5445641 ,  0.61193563,  0.67924615,  0.74671202,
    0.81404772,  0.88144172,  0.94885291,  1.01623919,  1.08361011,
    1.15100191,  1.21837793,  1.28578227,  1.3531658 ,  1.42054981,
    1.48794397,  1.55532424,  1.62272161,  1.69010744,  1.75749472,
    1.82488047,  1.89226717,  1.9596566 ,  2.02704774,  2.09443269,
    2.16182302,  2.2292107 ,  2.29659719,  2.36398595,  2.43137342,
    2.49876254,  2.56614983,  2.63353814,  2.700926  ,  2.76831392,
    2.83570198,  2.90308999,  2.97008999,  3.03708997,  3.10408999,
    3.17108999,  3.23808998,  3.30508998,  3.37208999,  3.43908999,
    3.50608998,  3.57308998,  3.64008999,  3.70708999,  3.77408999,
    3.84108999,  3.90808999])
y = array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
     0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
     0.00000000e+00,   5.50000000e+01,   1.33500000e+02,
     2.49000000e+02,   4.40000000e+02,   7.27000000e+02,
     1.09000000e+03,   1.53000000e+03,   2.21500000e+03,
     3.13500000e+03,   4.44000000e+03,   5.57000000e+03,
     6.77000000e+03,   8.04500000e+03,   9.15500000e+03,
     1.00000000e+04,   1.06000000e+04,   1.06500000e+04,
     1.02000000e+04,   9.29000000e+03,   8.01500000e+03,
     6.50000000e+03,   5.24000000e+03,   4.11000000e+03,
     2.97000000e+03,   1.86000000e+03,   1.02000000e+03,
     5.26500000e+02,   2.49000000e+02,   1.11000000e+02,
     5.27000000e+01,   6.90825000e+00,   4.54329000e+00,
     3.63846500e+00,   3.58135000e+00,   2.37404000e+00,
     1.81840000e+00,   1.20159500e+00,   6.02470000e-01,
     3.43295000e-01,   1.62295000e-01,   7.99350000e-02,
     3.60750000e-02,   1.50000000e-02,   3.61500000e-03,
     8.00000000e-05])

def datafit(x,N,u,sig):
    y = N/(np.sqrt(2*np.pi)*sig)*np.exp(-(x-u)**2/2*sig**2)
    return y
popt,popc = curve_fit(datafit,x,y,p0=[np.max(y),2,2])
Normal_distribution = norm(loc = popt[-2],scale = popt[-1])

Then I checked if the plot of (x,y) and (x,popt[0]*Normal_distribution.pdf(x))are same, but the result shows they are totally different.... Blue line is plot of (x,y)

The blue line is plot of (x,y), and the orange line is the plot of (x,popt[0]*Normal_distribution.pdf(x).

Why this happen? Is there anything wrong in my code?


Solution

  • depends on what you plotted, these look OK to me:

    plt.plot(x,y)
    Out[3]: [<matplotlib.lines.Line2D at 0xb9cef98>]
    
    popt,popc
    Out[4]: 
    (array([  8.41765250e+04,   1.98651581e+00,   3.15537860e+00]),
     array([[  5.64670700e+05,   1.12782889e-05,   1.15455042e+01],
            [  1.12782889e-05,   2.91058556e-06,   2.73909077e-10],
            [  1.15455042e+01,   2.73909077e-10,   2.88523818e-04]]))
    
    plt.plot(x,datafit(x,*popt))
    Out[5]: [<matplotlib.lines.Line2D at 0xb990080>]
    

    enter image description here

    my guess is that you've got an error in sig, scale and *,/ in your datafit def vs norm()

    I rewrote datafit to match the scipy norm.pdf

    and still have a factor of ~pi issue which may be just definitional: https://en.wikipedia.org/wiki/Normal_distribution

    oops, looks like the "factor of pi" was just coincidence of your particular data
    rereading the norm.pdf def suggest the whole is rescaled by the 'scale" factor so now I think it should be:

    '''
    norm.pdf(x) = exp(-x**2/2)/sqrt(2*pi)
    norm.pdf(x, loc, scale) == norm.pdf(y) / scale with y = (x - loc) / scale
    '''
    def datafit(x,N,u,sig):
    #    y = N/(np.sqrt(2*np.pi)*sig)*np.exp(-(x-u)**2/2*sig**2)
        y = N*np.exp(-((x-u)/sig)**2/2)/(np.sqrt(2*np.pi))
        return y
    popt,popc = curve_fit(datafit,x,y,p0=[np.max(y),2,2])
    
    # scipy norm.pdf with scaling factors to match datafit()
    Normal_distribution = popt[0]*popt[2]*norm.pdf(x, popt[1], popt[2])
    
    plt.plot(x,y, 'b')
    plt.plot(x, datafit(x+.1, *popt), 'g')
    plt.plot(x, Normal_distribution, 'r')
    

    enter image description here