Search code examples
pythonnumpyregressionstatsmodelsnon-linear-regression

StatsModels formula Polynomial Regression does not match numpy polyfit coefficients


My polynomial regression using statsmodels formula does not match nupy polyfit coefficients.

Link to data https://drive.google.com/file/d/1fQuCoCF_TeXzZuUFyKaHCbD1zle2f1MF/view?usp=sharing

Below is my code

import numpy as np
import pandas as pd
import scipy
import statsmodels.formula.api as smf

data = pd.read_csv('sp500.csv')

data['Date_Ordinal'] = pd.to_datetime(data['Date']).apply(lambda date: date.toordinal())

x = data['Date_Ordinal']
y = data['Value']

np.polyfit(x,y,2)

model = smf.ols(formula='y ~ x + I(x**2)', data = data).fit()
model.summary()

Numpy polyfit coefficient results:

array([ 4.17939013e-05, -6.09338454e+01, 2.22098809e+07])

Statsmodels coefficient results:

x**2: 7.468e-07

x: -0.5466

Intercept: -1.486e-06

When I add a quadratic trend line to the data in Excel, Excel results coincide with the numpy coefficients. However, if I add an intercept of 1 to the Excel trend line, the coefficients for x**2 and x equal the statsmodels coefficients but the excel intercept becomes 1 where as the statsmodels intercept is -1.486e-06.

If remove the intercept from the statsmodels formula by subtracting 1,all it does is remove the intercept altogether from statsmodels results but the coefficients remain the same.

How can I get statsmodels to show the same coefficient results as numpy polyfit and Excel?


Solution

  • Polynomials can become very badly scaled if the underlying data is not in a small range around zero. As a consequence, computation become numerically unstable and the results can be dominated by numerical noise.

    http://jpktd.blogspot.com/2012/03/numerical-accuracy-in-linear-least.html looks at a NIST test case with polynomials that are very badly scaled and many statistics packages cannot produce a numerically stable solution.

    Numpy's polynomial fitting can internally rescale the variables before creating the polynomial basis function.

    Generic regression models like OLS in statsmodels do not have the necessary information to rescale the underlying variables to improve numerical stability. Besides, scaling and handling multicollinearity is left to the decisions of the user. OLS summary should have printed a warning in this case.