I want to fix the betas in multi linear regression based on some data I have, which leads to a RSquare value less than 0% and greater than 100 % based on the projection approach mentioned in Tibshirani, Hastie et. all book.
What's the best way to compute RSquare after fixing the beta values for running multi linear regression with no intercept -
Load the Data
import numpy as np
import pandas as pd
import statsmodels.api as sm
data = sm.datasets.get_rdataset('iris').data
Define x
and y
variables -
x = data.iloc[:, 1:4].values
y = data.iloc[:, 0].values
Solve for betas as per Tibshirani Book -
betas = np.linalg.solve(x.T @ x, x.T @ y)
array([ 1.12106169, 0.92352887, -0.89567583])
sm.OLS(y, x).fit().params
array([ 1.12106169, 0.92352887, -0.89567583])
Alternately, Fixate betas per some understanding of the environment-
alt_betas = [3.7, -10, 45.78]
Now, Compute R Squared in 3 ways -
Using Statsmodel with no intercept
Using Projection Method for R Sq
Using Projection Method but using the Fixated Betas for RSq
sm.OLS(y, x).fit().rsquared * 100
99.61972754365206
(y @ x @ betas / (y @ y) ) * 100
99.61972754365208
(y @ x @ alt_betas / (y @ y) ) * 100
511.1237918393523
Now I understand it should be different given I'm using different betas, but this violates the rule that RSq should be between 0 and 1.
If I had some alternate betas, is there a way to fix it and use statsmodels OLS
to compute the R Square?
Think of it as I need to use Alternate Betas as my use case which I think is the true representation of the environment from my perspective.
Thanks in advance!
Conceptually, RSquare ("R2") is the relative difference between the sum of squared residuals for a null model and the sum of squared residuals for a particular model of interest.
An ideal model yields predicted values that exactly equal the actual data values, obtaining zero sum of squared residuals so R2 = 1. Less ideal models obtain R2 values less than 1.
Potentially a bad model of the data could obtain residuals even worse than those of the null model. For example, a model could generate "predicted" values by reversing the sign of each data value. If the model sum of squared residuals is larger than those of the null model, R2 will be negative.
Usually the null model is the mean value of the data being modelled (the predicted values from the null model are all the same, and equal to the mean of the data).
Alternatively, the null model could be zero (the predicted values from the null model are all the same, and equal to 0).
Expressions for R2 in terms of linear algebra can be derived from the information at OLS in Matrix Form and How can I represent R squared in matrix form?.
Cutting to the chase, we can write a Python function to calculate R2, as a percent, using linear algebra. This function uses either one of the two null models described above.
def r2pct(x, y, betas, central=True):
y_baseline = y - y.mean() if central else y
ss_null = y_baseline @ y_baseline
ss_model = y@y - 2*(betas.T @ x.T @ y) + (betas.T @ x.T @ x @ betas)
return (1 - ss_model/ss_null) * 100
This matches the value in your original post. These beta values account for 99.6% of the data's variance from zero.
r2pct(x, y, betas, central=False)
# 99.61972754365202
These beta values account for 80.6% of the variance in the data (that is, variance from the mean).
r2pct(x, y, betas)
# 80.55673214701147
The negative R2 value indicates that these alternative betas yield a model that fits the data very poorly, worse than just predicting that each data value would be 0.
r2pct(x, y, alt_betas, central=False)
# -2270.59299658298
These betas yield a model that fits the data very poorly, worse than just predicting that each data value would be equal to the data's mean.
r2pct(x, y, alt_betas)
# -121108.02817441802