Search code examples
pythoncurve-fitting

Python fit and chi square comparison


I want to know which functions fits better to my data by calculating the chi square value and then compare it.

The data i have could like either like a gaussian or kind of like an horizontal line, so my goal would be to fit each time my data (using scipy.curve fit) to both functions and then calculate the chisquare value (using scipy.stats), but can i directly compare both values and take the 'lowest' one as both functions have the same number of points but different parameters (hence not the same df): 4 parameters for my gaussian and 1 for my horizontal line ?

gaussian like horizontal like test fitting test fitting 2 using matt's answer

def horiz(x,c):
    return np.full_like(x, c)
#Define the Gaussian function
def gauss(x, H, A, x0, sigma):
    return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))

example of code below:

p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0 = p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c )

end.


Solution

  • You could use something like the Bayesian Information Criterion (BIC) for model selection, which accounts for the different numbers of parameters in each model with a penalty factor for more parameters.

    Mocking up a version of your example, this could be done with the following steps.

    1. Set up model functions and generate some simulated data containing a Gaussian "bump".

      import numpy as np
      from matplotlib import pyplot as plt
      from scipy.optimize import curve_fit
      
      def gauss(x, H, A, x0, sigma):
          return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
      
      def horiz(x,c):
          return np.full_like(x, c)
      
      # generate some data containing a Gaussian bump
      n = 10_000  # number of data points
      x = np.linspace(0, 1_000_000, n, dtype=float)
      
      # random Gaussian noise
      y = np.random.randn(n) * 5e-8
      
      # Gaussian bump parameters
      sigma = 30_000
      x0 = 400_000
      H = 0
      A = 4.5e-7
      
      y += gauss(x, H, A, x0, sigma)
      
    2. Use curve_fit to get a best fit, maximum likelihood, model (I've had to scale the parameters to get the fitting to work reasonably).

      # scale factors
      scaleamp = 1e8
      scalerange = 1e-5
      
      # initial parameter guesses
      iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange)
      
      # fit Gaussian bump model
      gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess)
      
      # fit horizontal line model
      linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
      
    3. Generate the best fit models and calculate the (log)likelihood for that model given the data (for the likelihood function, because we don't have an estimate of the uncertainty in the data, we can use the Student's t-like distribution defined in Eqn 3.38 in Sivia on pg. 53).

      Hf, Af, x0f, sigmaf = gaussfit[0]
      
      # rescale best fit parameters to generate the Gaussian model
      bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange)
      
      # get log likelihood
      logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2))
      
      # generate the best fit horizontal line model
      bestlinefit = horiz(x, linefit[0][0] / scaleamp)
      
      # get log likelihood
      logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
      
    4. Generate the BIC for both models.

      ngaussparams = 4  # number of Gaussian model parameters
      bicgauss = ngaussparams * np.log(n) - 2 * logLgauss
      
      nlineparams = 1  # number of parameters for the flat line
      bicline = nlineparams * np.log(n) - 2 * logLhoriz
      

    With the BIC, the model with the lower number is the preferred model. In this case the BIC for the Gaussian model is a lot lower (by about 16000 for the data I generated).