I am trying to fit empirical CDF plot to two Gaussian cdf as it seems that it has two peaks, but it does not work. I fit the curve with leastsq from scipy.optimize and erf function from scipy.special. The fitting only gives constant line at a value of 2. I am not sure in which part of the code that I make mistake. Any pointers will be helpful. Thanks!
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.array([ 90.64115156, 90.85690063, 91.07264971, 91.28839878,
91.50414786, 91.71989693, 91.93564601, 92.15139508,
92.36714415, 92.58289323, 92.7986423 , 93.01439138,
93.23014045, 93.44588953, 93.6616386 , 93.87738768,
94.09313675, 94.30888582, 94.5246349 , 94.74038397,
94.95613305, 95.17188212, 95.3876312 , 95.60338027,
95.81912935, 96.03487842, 96.2506275 , 96.46637657,
96.68212564, 96.89787472, 97.11362379, 97.32937287,
97.54512194, 97.76087102, 97.97662009, 98.19236917,
98.40811824, 98.62386731, 98.83961639, 99.05536546,
99.27111454, 99.48686361, 99.70261269, 99.91836176,
100.13411084, 100.34985991, 100.56560899, 100.78135806,
100.99710713, 101.21285621])
y = np.array([3.33333333e-04, 3.33333333e-04, 3.33333333e-04, 1.00000000e-03,
1.33333333e-03, 3.33333333e-03, 6.66666667e-03, 1.30000000e-02,
2.36666667e-02, 3.40000000e-02, 5.13333333e-02, 7.36666667e-02,
1.01666667e-01, 1.38666667e-01, 2.14000000e-01, 3.31000000e-01,
4.49666667e-01, 5.50000000e-01, 6.09000000e-01, 6.36000000e-01,
6.47000000e-01, 6.54666667e-01, 6.61000000e-01, 6.67000000e-01,
6.76333333e-01, 6.84000000e-01, 6.95666667e-01, 7.10000000e-01,
7.27666667e-01, 7.50666667e-01, 7.75333333e-01, 7.93333333e-01,
8.11333333e-01, 8.31333333e-01, 8.56333333e-01, 8.81333333e-01,
9.00666667e-01, 9.22666667e-01, 9.37666667e-01, 9.47333333e-01,
9.59000000e-01, 9.70333333e-01, 9.77333333e-01, 9.83333333e-01,
9.90333333e-01, 9.93666667e-01, 9.96333333e-01, 9.99000000e-01,
9.99666667e-01, 1.00000000e+00])
plt.plot(a,b,'r.')
# Fitting with 2 Gaussian
from scipy.special import erf
from scipy.optimize import leastsq
def two_gaussian_cdf(params, x):
(mu1, sigma1, mu2, sigma2) = params
model = 0.5*(1 + erf( (x-mu1)/(sigma1*np.sqrt(2)) )) +\
0.5*(1 + erf( (x-mu2)/(sigma2*np.sqrt(2)) ))
return model
def residual_two_gaussian_cdf(params, x, y):
model = double_gaussian(params, x)
return model - y
params = [5.,2.,1.,2.]
out = leastsq(residual_two_gaussian_cdf,params,args=(x,y))
double_gaussian(out[0],x)
plt.plot(x,two_gaussian_cdf(out[0],x))
which return to this plot
You may find lmfit
(see http://lmfit.github.io/lmfit-py/) to be a useful alternative to leastsq
here as it provides a higher-level interface to optimization and curve fitting (though still based on scipy.optimize.leastsq
). With lmfit, your example might look like this (cutting out the definition of x
and y
data):
#!/usr/bin/env python
import numpy as np
from scipy.special import erf
import matplotlib.pyplot as plt
from lmfit import Model
# define the basic model. I included an amplitude parameter
def gaussian_cdf(x, amp, mu, sigma):
return (amp/2.0)*(1 + erf( (x-mu)/(sigma*np.sqrt(2))))
# create a model that is the sum of two gaussian_cdfs
# note that a prefix names each component and will be
# applied to the parameter names for each model component
model = Model(gaussian_cdf, prefix='g1_') + Model(gaussian_cdf, prefix='g2_')
# make a parameters object -- a dict with parameter names
# taken from the arguments of your model function and prefix
params = model.make_params(g1_amp=0.50, g1_mu=94, g1_sigma=1,
g2_amp=0.50, g2_mu=98, g2_sigma=1.)
# you can apply bounds to any parameter
#params['g1_sigma'].min = 0 # sigma must be > 0!
# you may want to fix the amplitudes to 0.5:
#params['g1_amp'].vary = False
#params['g2_amp'].vary = False
# run the fit
result = model.fit(y, params, x=x)
# print results
print(result.fit_report())
# plot results, including individual components
comps = result.eval_components(result.params, x=x)
plt.plot(x, y,'r.', label='data')
plt.plot(x, result.best_fit, 'k-', label='fit')
plt.plot(x, comps['g1_'], 'b--', label='g1_')
plt.plot(x, comps['g2_'], 'g--', label='g2_')
plt.legend()
plt.show()
This prints out a report of
[[Model]]
(Model(gaussian_cdf, prefix='g1_') + Model(gaussian_cdf, prefix='g2_'))
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 66
# data points = 50
# variables = 6
chi-square = 0.00626332
reduced chi-square = 1.4235e-04
Akaike info crit = -437.253376
Bayesian info crit = -425.781238
[[Variables]]
g1_amp: 0.65818908 +/- 0.00851338 (1.29%) (init = 0.5)
g1_mu: 93.8438526 +/- 0.01623273 (0.02%) (init = 94)
g1_sigma: 0.54362156 +/- 0.02021614 (3.72%) (init = 1)
g2_amp: 0.34058664 +/- 0.01153346 (3.39%) (init = 0.5)
g2_mu: 97.7056728 +/- 0.06408910 (0.07%) (init = 98)
g2_sigma: 1.24891832 +/- 0.09204020 (7.37%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(g1_amp, g2_amp) = -0.892
C(g2_amp, g2_sigma) = 0.848
C(g1_amp, g2_sigma) = -0.744
C(g1_amp, g1_mu) = 0.692
C(g1_amp, g2_mu) = 0.662
C(g1_mu, g2_amp) = -0.607
C(g1_amp, g1_sigma) = 0.571
and a plot like this: