python numpy scipy mathematical-optimization

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.

I'm thinking of doing two things:

Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.

Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:

What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?

Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?

One example dataset

As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.

array([ 450.,  560.,  692.,  730.,  758.,  723.,  486.,  596.,  716.,
        695.,  757.,  522.,  535.,  419.,  478.,  666.,  637.,  569.,
        859.,  883.,  551.,  652.,  378.,  801.,  718.,  479.,  544.])

MLE Update

I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.

Added a np.asarray() call and everything is great!

Solution

This should have been a comment but I run out of space.

I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.

Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.

Alright, here you go:

>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *

>>> def eg_pdf(p, x): #defines the PDF
    m=p[0]
    s=p[1]
    l=p[2]
    return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))

>>> xo=array([ 450.,  560.,  692.,  730.,  758.,  723.,  486.,  596.,  716.,
        695.,  757.,  522.,  535.,  419.,  478.,  666.,  637.,  569.,
        859.,  883.,  551.,  652.,  378.,  801.,  718.,  479.,  544.])

>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)

>>> def llh(p, f, x): #defines the negative log-likelihood function
    return -sum(log(f(p,x)))

>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([  6.14003407e+02,   1.31843250e+02,   9.79425845e-02])

>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
         Current function value: 170.195924
         Iterations: 376
         Function evaluations: 681

>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([  6.14007039e+02,   1.31844654e+02,   9.78934519e-02])

The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).

The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:

>>> def eg_pdf2(p, x): #defines the PDF
    m=p[0]
    s=exp(p[1])
    l=exp(p[2])
    return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))

Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).

And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.