Search code examples
pythonstatisticsregressionlinear-regressionmodeling

"Full rank" error when estimating OLS with statsmodel


I have historical data for crop yield, annual temperature and annual precipitation for a given region. My goal is to estimate the following linear model: enter image description here

In which y is the crop annual yield, t stands for time (year), tmp for temperature (annual average) and p for precipitation (annual sum). Squared terms capture influence of extreme values.

My code is:

import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv('https://raw.githubusercontent.com/kevinkuranyi/data/main/crop_yield.csv')

model = smf.ols(formula = 'y_banana ~ year+year2+tmp+tmp2+pre+pre2+tmp_pre+tmp2_pre2',
 data=df, missing='drop').fit(cov_type='HAC', cov_kwds={'maxlags': 2})
model.summary()

By running this, I`m getting the following error message:

/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:1888: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 8, but rank is 5
  warnings.warn('covariance of constraints does not have full '

I suspected it could be due to multicolinearity problems, but no matter which variable I ommit, as long as I include more then 4 variables (even without interaction terms, or squared values, that could be linear combinations) I got this error. I included several combinations as examples in this Colab notebook.

What could be the problem?


Solution

  • You are using polynomials of badly scaled data.

    Calendar year and calendar year squared are badly scaled. For trend or similar use e.g. year - year0. Based on the very large standard error, tmp has a similar problem.

    Plot the polynomial functions and check that the values are approximately in the same range. For best behavior the data should be rescaled to a small range, e.g. interval [0,1] or largest value below 10.

    Numpy polynomial vander function has an option to automatically rescale the base variable.

    A related blog post that I wrote a long time ago. https://jpktd.blogspot.com/2012/03/numerical-accuracy-in-linear-least.html