pandas numpy statistics regression statsmodels

Compute R Squared by Fixing Betas for Multi Linear Regression without Intercept

I want to fix the betas in multi linear regression based on some data I have, which leads to a RSquare value less than 0% and greater than 100 % based on the projection approach mentioned in Tibshirani, Hastie et. all book.

What's the best way to compute RSquare after fixing the beta values for running multi linear regression with no intercept -

Load the Data

import numpy as np
import pandas as pd

import statsmodels.api as sm

data = sm.datasets.get_rdataset('iris').data

Define x and y variables -

x = data.iloc[:, 1:4].values
y = data.iloc[:, 0].values

Solve for betas as per Tibshirani Book -

betas = np.linalg.solve(x.T @ x, x.T @ y)

array([ 1.12106169,  0.92352887, -0.89567583])

sm.OLS(y, x).fit().params

array([ 1.12106169,  0.92352887, -0.89567583])

Alternately, Fixate betas per some understanding of the environment-

alt_betas = [3.7, -10, 45.78]

Now, Compute R Squared in 3 ways -

Using Statsmodel with no intercept
Using Projection Method for R Sq
Using Projection Method but using the Fixated Betas for RSq

Statsmodels

sm.OLS(y, x).fit().rsquared * 100

99.61972754365206

Projection

(y @ x @ betas / (y @ y) ) * 100

99.61972754365208

Projection with fixed betas

(y @ x @ alt_betas / (y @ y) ) * 100

511.1237918393523

Now I understand it should be different given I'm using different betas, but this violates the rule that RSq should be between 0 and 1.

If I had some alternate betas, is there a way to fix it and use statsmodels OLS to compute the R Square?

Think of it as I need to use Alternate Betas as my use case which I think is the true representation of the environment from my perspective.

Thanks in advance!

Solution

Conceptually, RSquare ("R2") is the relative difference between the sum of squared residuals for a null model and the sum of squared residuals for a particular model of interest.

$\Large R^2$

An ideal model yields predicted values that exactly equal the actual data values, obtaining zero sum of squared residuals so R2 = 1. Less ideal models obtain R2 values less than 1.

Potentially a bad model of the data could obtain residuals even worse than those of the null model. For example, a model could generate "predicted" values by reversing the sign of each data value. If the model sum of squared residuals is larger than those of the null model, R2 will be negative.

Usually the null model is the mean value of the data being modelled (the predicted values from the null model are all the same, and equal to the mean of the data).

The residuals for the null model measure deviations from the mean.
The sum of squared residuals for the null model is proportional to the second central moment (the variance).

Alternatively, the null model could be zero (the predicted values from the null model are all the same, and equal to 0).

Here the residuals for the null model are equal to the data values themselves.
The sum of squared residuals for the null (zero) model is proportional to the second moment (that is, the non-central moment about zero).

Expressions for R2 in terms of linear algebra can be derived from the information at OLS in Matrix Form and How can I represent R squared in matrix form?.

Cutting to the chase, we can write a Python function to calculate R2, as a percent, using linear algebra. This function uses either one of the two null models described above.

def r2pct(x, y, betas, central=True):
    y_baseline = y - y.mean() if central else y
    ss_null = y_baseline @ y_baseline
    ss_model = y@y - 2*(betas.T @ x.T @ y) + (betas.T @ x.T @ x @ betas)
    return (1 - ss_model/ss_null) * 100

R2 for best-fit betas, relative to null model = 0

This matches the value in your original post. These beta values account for 99.6% of the data's variance from zero.

r2pct(x, y, betas, central=False)
# 99.61972754365202

R2 for best-fit betas, relative to null model = mean

These beta values account for 80.6% of the variance in the data (that is, variance from the mean).

r2pct(x, y, betas)
# 80.55673214701147

R2 for "fixed" betas, relative to null model = 0

The negative R2 value indicates that these alternative betas yield a model that fits the data very poorly, worse than just predicting that each data value would be 0.

r2pct(x, y, alt_betas, central=False)
# -2270.59299658298

R2 for "fixed" betas, relative to null model = mean

These betas yield a model that fits the data very poorly, worse than just predicting that each data value would be equal to the data's mean.

r2pct(x, y, alt_betas)
# -121108.02817441802