Search code examples
pythoncurve-fittinggaussiandata-fitting

Gaussian data fit varying depending on position of x data


I am having a hard time trying to understand why my Gaussian fit to a set of data (ydata) does not work well if I shift the interval of x-values corresponding to that data (xdata1 to xdata2). The Gaussian is written as:

where A is just an amplitude factor. Changing some of the values of the data, it is easy to make it work for both cases, but one can also easily find cases in which it does not work well for xdata1 and also in which covariance of the parameters is not estimated. I am using scipy.optimize.curve_fit in Spyder with Python 3.7.1 on Windows 7.

results from code

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

xdata1 = np.linspace(-9,4,20, endpoint=True) # works fine
xdata2 = xdata1+2
ydata = np.array([8,9,15,12,14,20,24,40,54,94,160,290,400,420,300,130,40,10,8,4])

def gaussian(x, amp, mean, sigma):
    return amp*np.exp(-(((x-mean)**2)/(2*sigma**2)))/(sigma*np.sqrt(2*np.pi))

popt1, pcov1 = curve_fit(gaussian, xdata1, ydata)
popt2, pcov2 = curve_fit(gaussian, xdata2, ydata)

fig, ([ax1, ax2]) = plt.subplots(nrows=1, ncols=2,figsize=(9, 4))

ax1.plot(xdata1, ydata, 'b+:', label='xdata1')
ax1.plot(xdata1, gaussian(xdata1, *popt1), 'r-', label='fit')
ax1.legend()
ax2.plot(xdata2, ydata, 'b+:', label='xdata2')
ax2.plot(xdata2, gaussian(xdata2, *popt2), 'r-', label='fit')
ax2.legend()

Solution

  • The problem is your second attempt at fitting a gaussian is getting stuck in a local minimum while searching parameter space: curve_fit is a wrapper for least_squares which uses gradient descent to minimize the cost function and this is liable to get stuck in local minima.

    You should try providing reasonable starting parameters (by using the p0 argument of curve_fit) to avoid this:

     #...  your code
    
     y_max = np.max(y_data)
     max_pos = ydata[ydata==y_max][0]
     initial_guess = [y_max, max_pos, 1] # amplitude, mean, std
    
     popt2, pcov2 = curve_fit(gaussian, xdata2, ydata, p0=initial_guess)
    

    Which as you can see provides a reasonable fit:

    gaussian_fits_with_parameters

    You should write a function which can provide reasonable estimates of the starting parameters. Here I just found the maximum y value and used this to determine the initial parameters. I've found this works well for the fitting normal distributions but you could consider other methods.

    Edit:

    You can also solve the problem by scaling the amplitude: the amplitude is so large the parameter space is distorted and the gradient descent simply follows the direction of greatest change in the amplitude and effectively ignores the sigma. Consider the following plot in parameter space (Colour is the sum of the squared residuals of the fit for given parameters and the white cross shows the optimal solution):

    full_parameter_space

    Make sure to make note of the different scales for the x and y axis.

    One needs to make a large number of 'unit' sized steps in y (amplitude) to get to the minimum from the point x,y = (0,0), where as you only need less than one 'unit' sized step to get to the minimum in x (sigma). The algorithm simply takes steps in amplitude as this is the steepest gradient. When it gets to the amplitude which minimises the cost function it simply stops the algorithm as it appears to have converged and makes little or no changes in the sigma parameter.

    One way to fix this is to scale your ydata to un-distort the parameter space: divide your ydata by 100 and you will see your fit works without providing any starting parameters!