Search code examples
pythonplotgaussiandata-fitting

Gaussian fit to histogram on python seems off. What could I change to improve the fit?


I have created a Gaussian fit to data plotted as a bar chart. However, the fit does not look right, and I don't know what to change to improve the fit. My code is as follows:

import matplotlib.pyplot as plt
import math
import numpy as np
from collections import Counter
import collections
from scipy.optimize import curve_fit
from scipy.stats import norm
from scipy import stats
import matplotlib.mlab as mlab

k_list = [-40, -32, -30, -28, -26, -24, -22, -20, -18, -16, -14, -12, -10, -8, -6, -4, -3, -2, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34]
v_list = [1, 2, 11, 18, 65, 122, 291, 584, 1113, 2021, 3335, 5198, 7407, 10043, 12552, 14949, 1, 16599, 16770, 16728, 14772, 12475, 9932, 7186, 4987, 3286, 1950, 1080, 546, 285, 130, 54, 18, 11, 2, 2]

def func(x, A, beta, B, mu, sigma):
    return (A * np.exp(-x/beta) + B * np.exp(-100.0 * (x - mu)**2 / (2 * sigma**2))) #Normal distribution

popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=[10000, 5, 10000, 10, 10])
print(popt)


x = np.linspace(-50, 50, 1000)

plt.bar(k_list, v_list, label='myPLOT', color = 'b', width = 0.75)
plt.plot(x, func(x, *popt), color='darkorange', linewidth=2.5, label=r'Fitted function')

plt.xlim((-30, 45))
plt.legend()
plt.show()

The plot I obtain is as follows:

Bar plot with fitted Gaussian curve

How can I adjust my fit?


Solution

  • You have a significant outlier here, possibly caused by a typo: (k, v) == (-3, 1) at index 16 in the data.

    enter image description here

    The representation of the data as a bar chart is not optimal here. The issue would be clearly visible if you showed the data in the same format as you show the fit. Either of the following would work better:

    enter image description here

    The outlier forces the peak down. Here is the fit if we remove the outlier manually:

    enter image description here

    You can remove the outlier automatically by checking its individual residual against the RMSE of the entire fit:

    popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=[10000, 5, 10000, 10, 10])
    resid = np.abs(func(k_list, *popt) - v_list)
    rmse = np.std(resid)
    keep = resid < 3 * rmse
    if keep.sum() < keep.size:
         popt, pcov = curve_fit(func, xdata=k_list[keep], ydata=v_list[keep], p0=popt)
    

    Or even a repeated application:

    popt = [10000, 5, 10000, 10, 10]
    while True:
        popt, pcov = curve_fit(func, xdata=k_list, ydata=v_list, p0=popt)
        resid = np.abs(func(k_list, *popt) - v_list)
        rmse = np.std(resid)
        keep = resid < 5 * rmse
        if keep.sum() == keep.size:
            break
        k_list = k_list[keep]
        v_list = v_list[keep]
    

    A 3-sigma outlier will trim everything off your data after a couple of iterations, so I used 5-sigma. Keep in mind that this is a very quick and dirty way to denoise data. It's really basically manual, since you have to re-check the data to make sure that your choice of factor was correct.