I want to know which functions fits better to my data by calculating the chi square value and then compare it.
The data i have could like either like a gaussian or kind of like an horizontal line, so my goal would be to fit each time my data (using scipy.curve fit) to both functions and then calculate the chisquare value (using scipy.stats), but can i directly compare both values and take the 'lowest' one as both functions have the same number of points but different parameters (hence not the same df): 4 parameters for my gaussian and 1 for my horizontal line ?
def horiz(x,c):
return np.full_like(x, c)
#Define the Gaussian function
def gauss(x, H, A, x0, sigma):
return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
example of code below:
p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0 = p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c )
end.
You could use something like the Bayesian Information Criterion (BIC) for model selection, which accounts for the different numbers of parameters in each model with a penalty factor for more parameters.
Mocking up a version of your example, this could be done with the following steps.
Set up model functions and generate some simulated data containing a Gaussian "bump".
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
def gauss(x, H, A, x0, sigma):
return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
def horiz(x,c):
return np.full_like(x, c)
# generate some data containing a Gaussian bump
n = 10_000 # number of data points
x = np.linspace(0, 1_000_000, n, dtype=float)
# random Gaussian noise
y = np.random.randn(n) * 5e-8
# Gaussian bump parameters
sigma = 30_000
x0 = 400_000
H = 0
A = 4.5e-7
y += gauss(x, H, A, x0, sigma)
Use curve_fit
to get a best fit, maximum likelihood, model (I've had to scale the parameters to get the fitting to work reasonably).
# scale factors
scaleamp = 1e8
scalerange = 1e-5
# initial parameter guesses
iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange)
# fit Gaussian bump model
gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess)
# fit horizontal line model
linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
Generate the best fit models and calculate the (log)likelihood for that model given the data (for the likelihood function, because we don't have an estimate of the uncertainty in the data, we can use the Student's t-like distribution defined in Eqn 3.38 in Sivia on pg. 53).
Hf, Af, x0f, sigmaf = gaussfit[0]
# rescale best fit parameters to generate the Gaussian model
bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange)
# get log likelihood
logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2))
# generate the best fit horizontal line model
bestlinefit = horiz(x, linefit[0][0] / scaleamp)
# get log likelihood
logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
Generate the BIC for both models.
ngaussparams = 4 # number of Gaussian model parameters
bicgauss = ngaussparams * np.log(n) - 2 * logLgauss
nlineparams = 1 # number of parameters for the flat line
bicline = nlineparams * np.log(n) - 2 * logLhoriz
With the BIC, the model with the lower number is the preferred model. In this case the BIC for the Gaussian model is a lot lower (by about 16000 for the data I generated).