Search code examples
pythonscipycurve-fittingscipy-optimize

Instability in fitting data using Scipy Optimize library


I am attempting to fit a function to a set of data I have. The function in question is:

x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)

I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?

import numpy as np
import scipy.optimize as spo

def modelFun(t,A,B):
    return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
    
def errorFun(k,time,data):
    A = k[0]
    B = k[1]
    return np.sum((data-modelFun(time,A,B))**2)
    
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]

minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))

print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))

Datafile:

Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796

Solution

  • TL;DR;

    Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.

    This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.

    In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.

    MCVE

    First, let's reproduce your problem:

    import io
    import numpy as np
    import pandas as pd
    from scipy import optimize
    
    stream = io.StringIO("""Time,x
    201,2.67662
    204,3.28159
    ...
    273,14.27662
    274,14.40796""")
    data = pd.read_csv(stream)
    
    # Origin Shift:
    data = data.sub(data.iloc[0,:])
    data = data.set_index("Time")
    
    # Simplified model:
    def model(t, A, B):
        return -B + np.sqrt(A*B*t + np.power(B, 2))
    
    # NLLS Fit:
    parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
    # array([3.23405915e-01, 1.59960168e+05])
    # array([[ 3.65068730e-07, -3.93410484e+01],
    #        [-3.93410484e+01,  9.77198860e+12]])
    

    The adjustment is fair:

    enter image description here

    But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.

    Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.

    This claim is supported by the analysis of the first terms of the Taylor series:

    def taylor(t, A, B):
        return (A/2*t - A**2/B*t**2/8) 
    
    parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
    parameters
    # array([3.23396685e-01, 1.05237134e+09])
    

    Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).

    Analyzing Error Surface

    The problem can be reformulated as a minimization problem:

    def objective(beta, t, x):
        return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))
    
    result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))
    
    #      fun: 0.6594398116927569
    # hess_inv: array([[8.03062155e-06, 2.94644208e-04],
    #       [2.94644208e-04, 1.14979735e-02]])
    #      jac: array([2.07304955e-04, 6.40749931e-07])
    #  message: 'Desired error not necessarily achieved due to precision loss.'
    #     nfev: 389
    #      nit: 50
    #     njev: 126
    #   status: 2
    #  success: False
    #        x: array([3.24090627e-01, 2.11891188e+03])
    

    If we plot the MSE associated to your dataset, we get the following surface:

    enter image description here

    We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.

    Performing the same operation on synthetic dataset:

    t = np.linspace(0, 1000, 100)
    x = model(t, 0.35, 20)
    data = pd.DataFrame(x, index=t, columns=["x"])
    

    To have the square root shape in addition of the linear trend at the beginning.

    result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)
    
    #      fun: 1.9284246829733202e-10
    # hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
    #       [-4.42855253e-03,  4.59219063e-01]])
    #      jac: array([ 4.35726463e-03, -2.19158602e-05])
    #  message: 'Desired error not necessarily achieved due to precision loss.'
    #     nfev: 402
    #      nit: 94
    #     njev: 130
    #   status: 2
    #  success: False
    #        x: array([ 0.34999987, 20.000013  ])
    

    This version of the problem has following MSE surface:

    enter image description here enter image description here

    Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.

    Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.