To determine trends over time, I use scipy curve_fit
with X values from time.time()
, for example 1663847528.7147126
(1.6 billion).
Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0
values doesn't help. I found the magnitude of X to be a crucial element for this error and I wonder why?
Here is a simple snippet that shows working and non-working X offset:
import scipy.optimize
def fit_func(x, a, b):
return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
# (array([-1.e+08, 1.e+00]), array([[ 0., -0.],
# [-0., 0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
# OptimizeWarning: Covariance of the parameters could not be estimated
# warnings.warn('Covariance of the parameters could not be estimated',
# (array([-4.53788811e+08, 4.53788812e-01]), array([[inf, inf],
# [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
# (array([-7.60846335e+10, 7.60846334e+01]), array([[-1.97051972e+19, 1.97051970e+10],
# [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09, 1.e+00]), array([[inf, inf],
# [inf, inf]]))
As a side question, perhaps there's a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.
Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.
You are facing two problems:
1e8
to 1e9
you just hit the magnitude when such a kind of error become predominant.The second one is very important to realize. Let's say you are limited to 8 significant digits representation, then 1 000 000 000
and 1 000 000 001
are the same numbers as they are both limited to this writing 1.0000000e9
and we cannot accurately represents 1.0000000_e9
which requires one more digit (_
). This is why your second example fails.
Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.
You have three solutions:
I'll choose the first one as it is more generic, the second one has been proposed by @blunova
and totally makes sense, the latter is probably an inherent limitation.
To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:
import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
return a + b * x
xm = np.mean(x) # 1000000002.0
xs = np.std(x) # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2. , 1.41421356]),
# array([[0., 0.],
# [0., 0.]]))
# Back transformation:
a = result[0][1]/xs # 1.0
b = result[0][0] - xm*result[0][1]/xs # -1000000000.0
Or the same result using sklearn
interface:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", LinearRegression())
])
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_ # array([1.e+09])
pipe.named_steps["scaler"].scale_ # array([1.41421356])
pipe.named_steps["regressor"].coef_ # array([1.41421356])
pipe.named_steps["regressor"].intercept_ # 2.0
Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.
Simply write down and solve the transformation:
y = x'*a' + b'
x' = (x - m)/s
y = x*a + b
Which gives you the following solution:
a = a'/s
b = b' - m/s*a'
Numpy default float precision is float64
as you expected and has about 15 significant digits:
x.dtype # dtype('float64')
np.finfo(np.float64).precision # 15
But scipy.curve_fit
relies on scipy.least_square
which makes use of a squared metric to drive the optimization.
Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to 1e9
you reach the threshold where Float Arithmetic Error becomes predominant.
So this threshold of 1e9
you have hit is not related to the distinction between numbers on your variable x
(float64
has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
You can also check that in its signature, tolerances are about 8 decades wide:
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
loss='linear', f_scale=1.0, diff_step=None, tr_solver=None,
tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
args=(), kwargs={})
Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.
What is interesting with scipy.stats.linregress
method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the TINY
variable) to solve the LS problem even in problematic conditions.
This of course contrasts with the scipy.optimize.curve_fit
method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).
If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.