python data-analysis curve-fitting scipy-optimize

Gaussian fit with gap in data

I would like to fit a gaussian on an absorption band, in a reflectance spectra. Problem is, in this part of the spectra I have a gap in my data (blank space on the figure), between approximately 0.85 and 1.3 µm. I really need this gaussian, because after this I would like to compute spectral parameters such as the Full-With Half-Maximum and the area occupied by the gaussian.

Here is the functions I use to perform the gaussian fit (credit to this post) :

def gauss(x, H, A, x0, sigma):
    return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))

def gauss_fit(x, y):
    mean = sum(x * y) / sum(y)
    sigma = np.sqrt(sum(y * (x - mean) ** 2) / sum(y))
    popt, pcov = scipy.optimize.curve_fit(gauss, x, y, p0=[min(y), max(y), mean, sigma])
    return popt

How I apply it :

H, A, x0, sigma = gauss_fit(x, y)

And plot it :

plt.plot(x, y, '.k', label='data')
plt.plot(x, gauss(x, *gauss_fit(x, y)), '-r', label='fit')
plt.xlabel('Wavelength (µm)')
plt.ylabel('Reflectance')
plt.legend()

Here is what I get : Gaussian fit with gap in data

As you can see the fit seems to work fine until it reaches the gap.

It would really help me if I could have a clean gaussian fit, bonus points if I can then easily compute the FMWH and the area of the gaussian afterward.

Please note that the solution should not be too case-specific, as I have huge datasets to work on. It should thus be implementable in loops.

The only post I could find talking about this issue is this one, but it did not provide me with a satisfying solution.

It's my first time posting on Stack Overflow, please feel free to ask for any complementary information if needed.

Edit 1

I think Tino D and lastchance answers solved the main problem of my code, it being that I defined a normal gaussian fit instead of a substracted one. I obtain this new fit.

However as you can see it is still not perfect, and it strangely goes beyond y=1 on the right side of the fit. I think the problem now resides in my data itself, so here it is as requested.

I use pandas to manage my data, so the file is in pickle format as I store entire nupy arrays in unique DataFrame cells. For this fit we only need the columns 'Wavelength' and 'Reflectance Normalized'. We also only need part of the spectra, so here is the code I use to sample what I need :

#df_merged is the name of the DataFrame

R0700=FindNearest(df_merged.at[0,'Wavelength'], df_merged.at[0,'Reflectance Normalized'], 0.7, show=True)
R1800=FindNearest(df_merged.at[0,'Wavelength'], df_merged.at[0,'Reflectance Normalized'], 1.8, show=True)

x=df_merged.at[0,'Wavelength'][R0700[1]:R1800[1]]
y=df_merged.at[0,'Reflectance Normalized'][R0700[1]:R1800[1]]

FindNearest is a function of mine I use to find specific values in my arrays, it is defined as the following :

def FindNearest(wl, rf, x, show=False):      # wl = wavelength array ; rf = reflectance array ; x = value to find
    
    wl_dif=np.abs(wl-x)
    idx=np.argmin(wl_dif)
    
# Get reflectance and wavelength values
    
    wl_x=wl[idx]
    rf_x=rf[idx]
    
    if show:
        print("Nearest wavelength :", wl_x, ", Index :", idx, ", Corresponding reflectance :", rf_x)

    
# Values : [0] = nearest wavelength value, [1] = position of nearest value in array 
# (both reflectance and wavelength since same size), [2] = nearest reflectance value
    
    return(wl_x, idx, rf_x)

We almost got this, thanks a lot to you 2 !

Solution

As lastchance said and as my comment suggested, you were fitting the wrong curve (or your initial guess was not compatible). The function that you wrote was a normal Gaussian and not a subtracted one. So the following can be used

import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def gauss(x, H, A, x0, sigma):
    return H - A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
def gauss_fit(x, y):
    mean = sum(x * y) / sum(y)
    sigma = np.sqrt(sum(y * (x - mean) ** 2) / sum(y))
    popt, pcov = curve_fit(gauss, x, y, p0=[min(y), max(y), mean, sigma])
    return popt
# generate toydataset
x = np.linspace(0.5, 2, 100)
H, A, x0, sigma = 1, 0.5, 1.2, 0.3
y = gauss(x, H, A, x0, sigma) + 0.005 * np.random.normal(size=x.size)
# enerate gap in data
gapStart = int(0.1 * len(x))
gapEnd = int(0.4 * len(x))
xGap = np.concatenate([x[:gapStart], x[gapEnd:]])
yGap = np.concatenate([y[:gapStart], y[gapEnd:]])
popt = gauss_fit(xGap, yGap)
plt.figure()
plt.scatter(xGap, yGap, label='Gap in data', color='blue')
plt.plot(x, gauss(x, *popt), label='Fit', color='red', linestyle='--')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title("30% missing")
plt.grid()

This results in the following:

Now I did generate the toydata but the idea remains the same. As for the calculations:

H_fit, A_fit, x0_fit, sigma_fit = popt
'''
full width at half maximum taken from here and area calculation here:
https://www.physicsforums.com/threads/area-under-gaussian-peak-by-easy-measurements.419285/
'''
FWHM = 2.35 * sigma_fit 
Area = H_fit*FWHM/(2.35*0.3989)
print(f"FWHM = {FWHM}")
print(f"Area = {Area}")

Results:

FWHM = 0.7030608784583746
Area = 0.7486638847495126

Hope this helps!

Edit 1: with data from OP

The badly fitted line that you showed in your example is because you are not passing those x values in the middle.

Alternatively:

popt = gauss_fit(x, y)
plt.figure()
plt.scatter(x, y, label='Gap in data', color='blue')
plt.plot(np.linspace(min(x), max(x), 100), # resample the x variable
         gauss(np.linspace(min(x), max(x), 100), *popt), # and calculate
         label='Fit', color='red', linestyle='--')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid()

We can resample the x variable from min to max with 100 points using np.linspace, and also important to pass it to the gaus functions. Results:

I highly recommend to ditch Gaussian or to hard-code H=1 as lastchance suggested. The data on the top right is too different. OR you say that you don't care about these points and that you can cut them out.

Edit 2: setting the upper limit to 1

As you saw from your calculations, the fit went overboard and set the sigma as a minus value (hence the negative values). For the fit it is not important since sigma is raised to the power of 2. In any case, I suggest the followign adjustments to your code, although I still think a Gaussian fit does not make sense:

def gauss(x, A, x0, sigma):
    return 1 - A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
def gauss_fit(x, y):
    mean = sum(x * y) / sum(y)
    sigma = np.sqrt(sum(y * (x - mean) ** 2) / sum(y))
    popt, pcov = curve_fit(gauss, x, y)
    return popt

In the above functions, H is being set to 1. There is also no need for the initial guesses, the optimizer finds values automatically.

The code for the plotting stays the same. Results:

This gives the following:

FWHM = 0.5
Area = 0.54

Now we see that the fit prefers sticking with the edges to minimize the error and neglects a good fit on the tail. Which btw is too wide of a tail for a Gaussian fit to begin with (in my eyes). The distribution looks slightly platykurtic. Which we can also confirm by calculating the kurtosis:

mean = np.mean(y)
std = np.std(y, ddof=1)
n = len(y)
kurtosis = (n * (n + 1) * np.sum((y - mean) ** 4) / ((n - 1) * (n - 2) * (n - 3) * std ** 4)) - (3 * (n - 1) ** 2 / ((n - 2) * (n - 3)))
print(kurtosis) # -0.13344688172380037

-0.13 is not that far away from 0, but still I think the fit can be done in a better way.

Reading material: Link

I suggest you open another question on crossvalidated to ask for a recommendation regarding which curve to fit.