pythonstatisticsregressionlinear-regressionmodeling# "Full rank" error when estimating OLS with statsmodel

I have historical data for crop yield, annual temperature and annual precipitation for a given region. My goal is to estimate the following linear model:

In which y is the crop annual yield, t stands for time (year), tmp for temperature (annual average) and p for precipitation (annual sum). Squared terms capture influence of extreme values.

My code is:

```
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv('https://raw.githubusercontent.com/kevinkuranyi/data/main/crop_yield.csv')
model = smf.ols(formula = 'y_banana ~ year+year2+tmp+tmp2+pre+pre2+tmp_pre+tmp2_pre2',
data=df, missing='drop').fit(cov_type='HAC', cov_kwds={'maxlags': 2})
model.summary()
```

By running this, I`m getting the following error message:

```
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:1888: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 8, but rank is 5
warnings.warn('covariance of constraints does not have full '
```

I suspected it could be due to multicolinearity problems, but no matter which variable I ommit, as long as I include more then 4 variables (even without interaction terms, or squared values, that could be linear combinations) I got this error. I included several combinations as examples in this Colab notebook.

What could be the problem?

Solution

You are using polynomials of badly scaled data.

Calendar year and calendar year squared are badly scaled. For trend or similar use e.g. year - year0. Based on the very large standard error, `tmp`

has a similar problem.

Plot the polynomial functions and check that the values are approximately in the same range. For best behavior the data should be rescaled to a small range, e.g. interval [0,1] or largest value below 10.

Numpy polynomial `vander`

function has an option to automatically rescale the base variable.

A related blog post that I wrote a long time ago. https://jpktd.blogspot.com/2012/03/numerical-accuracy-in-linear-least.html

- How to return a csv file/Pandas DataFrame in JSON format using FastAPI?
- Get total amount of free GPU memory and available using pytorch
- How to calculate the midpoints of each triangle edges of an icosahedron
- Problems with .exe file after converting it from .py. "ModuleNotFoundError: no module named 'selenium'"
- List of tuples merging
- Getting user ID in TwitchIO
- Finding Successors of Successors in a Directed Graph in NetworkX
- How to create numpy.ndarray A from existing B using various functions depending on corresponding value in B?
- Using PySide's QtWebKit under Windows with py2exe
- Parallelizing, Multiprocessing, CSV writer
- Highlight the changes or areas of discrepancy between the two images
- How to effectively replace sentences in word document with python
- How do you check whether a number is divisible by another number?
- MinMaxScaler for a number of columns in a pandas DataFrame
- ModuleNotFoundError: No module named 'keras.saving.pickle_utils'
- How to read zarr files correctly from minio？
- External contour of coplex figure within many circles
- pip uses incorrect cached package version, instead of the user-specified version
- PIPX install path change
- Close while loop running in file A from file B in python on windows/linux
- Ignore Non Trading days (Holidays / remove gaps) in Plotly candlesticks for "Minutes / Hours) data
- How to combine a function, for outputting a star pattern, with an input loop
- How to print output of python script to Windows console, when running via batch file?
- How to use a single named parameter after *args
- SQLAlchemy ER diagram in python 3
- Why cannot python PIL show two images in one program
- How can I change Content-Length in Python?
- return the name of the row(first column) of the maximum difference between two columns
- Pandas: Drop duplicate but consecutive rows and keep the first row within group
- Getting error while making a dropdown menu for discord bot