Search code examples
pythonpython-3.xscipystatisticscurve-fitting

How to get confidence intervals from curve_fit


My question involves statistics and python and I am a beginner in both. I am running a simulation, and for each value for the independent variable (X) I produce 1000 values for the dependent variable (Y). What I have done is that I calculated the average of Y for each value of X and fitted these averages using scipy.optimize.curve_fit. The curve fits nicely, but I want to draw also the confidence intervals. I am not sure if what I am doing is correct or if what I want to do can be done, but my question is how can I get the confidence intervals from the covariance matrix produced by curve_fit. The code reads the averages from files first then it just simply uses curve_fit.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit


def readTDvsTx(L, B, P, fileformat):
    # L should be '_Fixed_' or '_'
    TD = []
    infile = open(fileformat.format(L, B, P), 'r')
    infile.readline()  # To remove header
    for line in infile:
        l = line.split()  # each line contains TxR followed by CD followed by TD
        if eval(l[0]) >= 70 and eval(l[0]) <=190:
            td = eval(l[2])
            TD.append(td)
    infile.close()
    tdArray = np.array(TD)

    return tdArray


def rec(x, a, b):
    return a * (1 / (x**2)) + b



fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
txR = np.array(range(70, 200, 20))
parents = np.array(range(1,6))
disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)


popt, pcov = curve_fit(rec, txR, disc_p1)


plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
plt.plot(txR, disc_p1, '.')

print(popt)
plt.show()

And here is the resulting fit: enter image description here


Solution

  • Here's a quick and wrong answer: you can approximate the errors from the covariance matrix for your a and b parameters as the square root of its diagonals: np.sqrt(np.diagonal(pcov)). The parameter uncertainties can then be used to draw the confidence intervals.

    The answer is wrong because you before you fit your data to a model, you'll need an estimate of the errors on your averaged disc_p1 points. When averaging, you have lost the information about the scatter of the population, leading curve_fit to believe that the y-points you feed it are absolute and undisputable. This might cause an underestimation of your parameter errors.

    For an estimate of the uncertainties of your averaged Y values, you need to estimate their dispersion measure and pass it along to curve_fit while saying that your errors are absolute. Below is an example of how to do this for a random dataset where each of your points consists of a 1000 samples drawn from a normal distribution.

    from scipy.optimize import curve_fit
    import matplotlib.pylab as plt
    import numpy as np
    
    # model function
    func = lambda x, a, b: a * (1 / (x**2)) + b 
    
    # approximating OP points
    n_ypoints = 7 
    x_data = np.linspace(70, 190, n_ypoints)
    
    # approximating the original scatter in Y-data
    n_nested_points = 1000
    point_errors = 50
    y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
              n_nested_points) for x in x_data]
    
    # averages and dispersion of data
    y_means = np.array(y_data).mean(axis = 1)
    y_spread = np.array(y_data).std(axis = 1)
    
    best_fit_ab, covar = curve_fit(func, x_data, y_means,
                                   sigma = y_spread,
                                   absolute_sigma = True)
    sigma_ab = np.sqrt(np.diagonal(covar))
    
    from uncertainties import ufloat
    a = ufloat(best_fit_ab[0], sigma_ab[0])
    b = ufloat(best_fit_ab[1], sigma_ab[1])
    text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
    print(text_res)
    
    # plotting the unaveraged data
    flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
                        markersize = 3, alpha=0.7)
    line_kwargs = dict(color = 'k', linewidth = 1)
    bp = plt.boxplot(y_data, positions = x_data,
                     capprops = line_kwargs,
                     boxprops = line_kwargs,
                     whiskerprops = line_kwargs,
                     medianprops = line_kwargs,
                     flierprops = flier_kwargs,
                     widths = 5,
                     manage_ticks = False)
    # plotting the averaged data with calculated dispersion
    #plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
    #plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')
    
    # plotting the model
    hires_x = np.linspace(50, 190, 100)
    plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
    bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
    bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
    # plotting the confidence intervals
    plt.fill_between(hires_x, bound_lower, bound_upper,
                     color = 'black', alpha = 0.15)
    plt.text(140, 800, text_res)
    plt.xlim(40, 200)
    plt.ylim(0, 1000)
    plt.show()
    

    absolutely weighted least squares

    Edit: If you are not considering the intrinsic errors on the data points, you are probably fine with using the "qiuck and wrong" case I mentioned before. The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals. However, note that the confidence intervals have shrunk now that we've dropped the uncertainties:

    from scipy.optimize import curve_fit
    import matplotlib.pylab as plt
    import numpy as np
    
    func = lambda x, a, b: a * (1 / (x**2)) + b
    
    n_ypoints = 7
    x_data = np.linspace(70, 190, n_ypoints)
    
    y_data = np.array([786.31, 487.27, 341.78, 265.49,
                        224.76, 208.04, 200.22])
    best_fit_ab, covar = curve_fit(func, x_data, y_data)
    sigma_ab = np.sqrt(np.diagonal(covar))
    
    # an easy way to properly format parameter errors
    from uncertainties import ufloat
    a = ufloat(best_fit_ab[0], sigma_ab[0])
    b = ufloat(best_fit_ab[1], sigma_ab[1])
    text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
    print(text_res)
    
    plt.scatter(x_data, y_data, facecolor = 'silver',
                edgecolor = 'k', s = 10, alpha = 1)
    
    # plotting the model
    hires_x = np.linspace(50, 200, 100)
    plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
    bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
    bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
    # plotting the confidence intervals
    plt.fill_between(hires_x, bound_lower, bound_upper,
                     color = 'black', alpha = 0.15)
    plt.text(140, 630, text_res)
    plt.xlim(60, 200)
    plt.ylim(0, 800)
    plt.show()
    

    no-sigma-case

    If you're unsure whether to include the absolute errors or how to estimate them in your case, you'd be better off asking for advice at Cross Validated, as Stack Overflow is mainly for discussion on implementations of regression methods and not for discussion on the underlying statistics.