Search code examples
python-3.xcurve-fittinggaussianleast-squarescdf

Fit CDF with 2 Gaussian using LeastSq


I am trying to fit empirical CDF plot to two Gaussian cdf as it seems that it has two peaks, but it does not work. I fit the curve with leastsq from scipy.optimize and erf function from scipy.special. The fitting only gives constant line at a value of 2. I am not sure in which part of the code that I make mistake. Any pointers will be helpful. Thanks!

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.array([ 90.64115156,  90.85690063,  91.07264971,  91.28839878,
        91.50414786,  91.71989693,  91.93564601,  92.15139508,
        92.36714415,  92.58289323,  92.7986423 ,  93.01439138,
        93.23014045,  93.44588953,  93.6616386 ,  93.87738768,
        94.09313675,  94.30888582,  94.5246349 ,  94.74038397,
        94.95613305,  95.17188212,  95.3876312 ,  95.60338027,
        95.81912935,  96.03487842,  96.2506275 ,  96.46637657,
        96.68212564,  96.89787472,  97.11362379,  97.32937287,
        97.54512194,  97.76087102,  97.97662009,  98.19236917,
        98.40811824,  98.62386731,  98.83961639,  99.05536546,
        99.27111454,  99.48686361,  99.70261269,  99.91836176,
       100.13411084, 100.34985991, 100.56560899, 100.78135806,
       100.99710713, 101.21285621])
y = np.array([3.33333333e-04, 3.33333333e-04, 3.33333333e-04, 1.00000000e-03,
       1.33333333e-03, 3.33333333e-03, 6.66666667e-03, 1.30000000e-02,
       2.36666667e-02, 3.40000000e-02, 5.13333333e-02, 7.36666667e-02,
       1.01666667e-01, 1.38666667e-01, 2.14000000e-01, 3.31000000e-01,
       4.49666667e-01, 5.50000000e-01, 6.09000000e-01, 6.36000000e-01,
       6.47000000e-01, 6.54666667e-01, 6.61000000e-01, 6.67000000e-01,
       6.76333333e-01, 6.84000000e-01, 6.95666667e-01, 7.10000000e-01,
       7.27666667e-01, 7.50666667e-01, 7.75333333e-01, 7.93333333e-01,
       8.11333333e-01, 8.31333333e-01, 8.56333333e-01, 8.81333333e-01,
       9.00666667e-01, 9.22666667e-01, 9.37666667e-01, 9.47333333e-01,
       9.59000000e-01, 9.70333333e-01, 9.77333333e-01, 9.83333333e-01,
       9.90333333e-01, 9.93666667e-01, 9.96333333e-01, 9.99000000e-01,
       9.99666667e-01, 1.00000000e+00])
plt.plot(a,b,'r.')

# Fitting with 2 Gaussian
from scipy.special import erf
from scipy.optimize import leastsq

def two_gaussian_cdf(params, x):
    (mu1, sigma1, mu2, sigma2) = params
    model = 0.5*(1 + erf( (x-mu1)/(sigma1*np.sqrt(2)) )) +\
            0.5*(1 + erf( (x-mu2)/(sigma2*np.sqrt(2)) ))
    return model

def residual_two_gaussian_cdf(params, x, y):
    model = double_gaussian(params, x)
    return model - y

params = [5.,2.,1.,2.]
out = leastsq(residual_two_gaussian_cdf,params,args=(x,y))
double_gaussian(out[0],x)
plt.plot(x,two_gaussian_cdf(out[0],x))

which return to this plot

enter image description here


Solution

  • You may find lmfit (see http://lmfit.github.io/lmfit-py/) to be a useful alternative to leastsq here as it provides a higher-level interface to optimization and curve fitting (though still based on scipy.optimize.leastsq). With lmfit, your example might look like this (cutting out the definition of x and y data):

    #!/usr/bin/env python  
    import numpy as np
    from scipy.special import erf
    import matplotlib.pyplot as plt
    from lmfit import Model
    
    # define the basic model.  I included an amplitude parameter
    def gaussian_cdf(x, amp, mu, sigma):
        return (amp/2.0)*(1 + erf( (x-mu)/(sigma*np.sqrt(2))))
    
    # create a model that is the sum of two gaussian_cdfs
    # note that a prefix names each component and will be
    # applied to the parameter names for each model component
    model = Model(gaussian_cdf, prefix='g1_') + Model(gaussian_cdf, prefix='g2_')
    
    # make a parameters object -- a dict with parameter names
    # taken from the arguments of your model function and prefix
    params = model.make_params(g1_amp=0.50, g1_mu=94, g1_sigma=1,
                               g2_amp=0.50, g2_mu=98, g2_sigma=1.)
    
    # you can apply bounds to any parameter
    #params['g1_sigma'].min = 0   # sigma must be > 0!
    
    # you may want to fix the amplitudes to 0.5:
    #params['g1_amp'].vary = False
    #params['g2_amp'].vary = False
    
    # run the fit
    result = model.fit(y, params, x=x)
    
    # print results
    print(result.fit_report())
    
    # plot results, including individual components
    comps = result.eval_components(result.params, x=x)
    
    plt.plot(x, y,'r.', label='data')
    plt.plot(x, result.best_fit, 'k-', label='fit')
    plt.plot(x, comps['g1_'], 'b--', label='g1_')
    plt.plot(x, comps['g2_'], 'g--', label='g2_')
    plt.legend()
    plt.show()
    

    This prints out a report of

    [[Model]]
        (Model(gaussian_cdf, prefix='g1_') + Model(gaussian_cdf, prefix='g2_'))
    [[Fit Statistics]]
        # fitting method   = leastsq
        # function evals   = 66
        # data points      = 50
        # variables        = 6
        chi-square         = 0.00626332
        reduced chi-square = 1.4235e-04
        Akaike info crit   = -437.253376
        Bayesian info crit = -425.781238
    [[Variables]]
        g1_amp:    0.65818908 +/- 0.00851338 (1.29%) (init = 0.5)
        g1_mu:     93.8438526 +/- 0.01623273 (0.02%) (init = 94)
        g1_sigma:  0.54362156 +/- 0.02021614 (3.72%) (init = 1)
        g2_amp:    0.34058664 +/- 0.01153346 (3.39%) (init = 0.5)
        g2_mu:     97.7056728 +/- 0.06408910 (0.07%) (init = 98)
        g2_sigma:  1.24891832 +/- 0.09204020 (7.37%) (init = 1)
    [[Correlations]] (unreported correlations are < 0.100)
        C(g1_amp, g2_amp)     = -0.892
        C(g2_amp, g2_sigma)   =  0.848
        C(g1_amp, g2_sigma)   = -0.744
        C(g1_amp, g1_mu)      =  0.692
        C(g1_amp, g2_mu)      =  0.662
        C(g1_mu, g2_amp)      = -0.607
        C(g1_amp, g1_sigma)   =  0.571
    

    and a plot like this:

    enter image description here This fit is not perfect, but it should get you started.