Search code examples
pythonnumpyscipycurve-fitting

scipy curve_fit doesn't vary some parameters depending on initial values


I'm just fitting a linear function to some data, but ran into issues with the default initial values being set to 1 in curve_fit. I'm therefore looking to change the initial values, but the code has to be pretty general as I'm looking to apply it to different y variables. So I set the intercept value to just be the average of the 1st 20 datapoints, as this should be pretty close to the optimal answer (my x values are offset to start near 0). But I made the slope 0, as for different variables there could be any sign relationship, or none.

However, with the slope at zero in my example, curve_fit doesn't vary this slope - and instead changes the intercept to best fit the data, pulling it away from where it should really be. This behavious doesn't change when the value is changed to be non-zero but still tiny - in the example 0.01 and 0.001 have the same effect. It gives a 'Covariance of the parameters could not be estimated' warning when it doesn't vary the parameter.

There data are all float64, so it's not related directly to this previously discussed issue here.

I'm not sure if this is the best way to share but I've uploaded the data on DropBox here. The code below along with this data should reproduce the issue. My scipy is version 1.11.3.

Grateful for any ideas on why this might be happening!

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import pickle

def fit1(x, alpha, beta_t):
    yhat = alpha + beta_t*x
    return yhat


with open('example_data.pkl', 'rb') as handle:
    xs, ys = pickle.load(handle)
    
plt.plot(xs, ys, color='blue')

for guess in [0, 0.001, 0.01, 0.1, 1, 10]:
        
    init_guess = np.zeros(2)
    init_guess[0] = np.mean(ys[:20])
    init_guess[1] = guess
    params, _ = curve_fit(fit1, xs, ys, p0=init_guess)
    
    plt.plot(xs, fit1(xs, *params), label=f'init_guess[1] = {init_guess[1]}')


plt.legend()

enter image description here


Solution

  • Expanding on the ideas mentioned in the comments:

    If you scale the data so that most points are around 1, that gets rid of both the error about being unable to estimate covariance, and allows it to fit the linear regression no matter what the initial guess is.

        params, _ = curve_fit(fit1, xs, ys/1e9, p0=init_guess)
        plt.plot(xs, fit1(xs, *params) * 1e9, label=f'init_guess[1] = {init_guess[1]}')
    

    scaling result

    You could also use StandardScaler rather than multiplying and dividing by a constant.

    Alternately, if you use scipy.stats.linregress(), it can find the solution directly, without an initial guess or scaling:

    res = scipy.stats.linregress(xs, ys)
    plt.plot(xs, ys, color='blue')
    plt.plot(xs, res.intercept + res.slope*xs)
    

    linregress result

    (Note: this is after I sorted both xs and ys by xs.)

    linregress() is also about 20x faster than curve_fit() for this dataset, because curve_fit() is more flexible.