Search code examples
pythonmachine-learningscikit-learnlinear-regressionstatsmodels

OLS Regression Results R-squared value ca r2_squared by scikitlearn


I followed a tutorial online and used OLS to build the model (from statsmodel!) The OLS analysis result gave me an amazing R^2 value (0.909). However, when I tried using the r2_score function by scikit-learn to evaluate the R^2 score, I only got 0.68.

Can someone tell what the difference is here?

The dataset came from here: https://www.kaggle.com/harlfoxem/housesalesprediction

Attached is my code!

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm


df = pd.read_csv('kc_house_data.csv')

df = df.drop(['id','date'], axis = 1)

X = df.iloc[:, 1:]
y= df.iloc[:, 0]

X = sm.add_constant(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

reg_OLS = sm.OLS(endog=y_train, exog=X_train).fit()

reg_OLS.summary()

y_pred = reg_OLS.predict(X_test)

print(r2_score(y_test, y_pred))

Output from OLS Regression Results

OLS Regression Results
Dep. Variable:  price   R-squared (uncentered): 0.909
Model:  OLS Adj. R-squared (uncentered):    0.909
Method: Least Squares   F-statistic:    8491.
Date:   Thu, 17 Mar 2022    Prob (F-statistic): 0.00
Time:   23:36:48    Log-Likelihood: -1.9598e+05
No. Observations:   14408   AIC:    3.920e+05
Df Residuals:   14391   BIC:    3.921e+05
Df Model:   17      
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [0.025  0.975]
bedrooms    -2.898e+04  2217.526    -13.068 0.000   -3.33e+04   -2.46e+04
bathrooms   3.621e+04   3845.397    9.416   0.000   2.87e+04    4.37e+04
sqft_living 100.2140    2.690   37.257  0.000   94.942  105.486
sqft_lot    0.2609  0.057   4.563   0.000   0.149   0.373
floors  1.201e+04   4187.373    2.867   0.004   3798.844    2.02e+04
waterfront  6.237e+05   2.01e+04    31.012  0.000   5.84e+05    6.63e+05
view    5.237e+04   2566.027    20.410  0.000   4.73e+04    5.74e+04
condition   2.844e+04   2774.191    10.250  0.000   2.3e+04 3.39e+04
grade   9.613e+04   2558.509    37.571  0.000   9.11e+04    1.01e+05
sqft_above  63.2770 2.638   23.985  0.000   58.106  68.448
sqft_basement   36.9370 3.127   11.813  0.000   30.808  43.066
yr_built    -2529.7423  80.708  -31.344 0.000   -2687.940   -2371.544
yr_renovated    13.0704 4.307   3.035   0.002   4.629   21.512
zipcode -510.0820   21.216  -24.043 0.000   -551.667    -468.497
lat 6.084e+05   1.27e+04    47.725  0.000   5.83e+05    6.33e+05
long    -2.076e+05  1.55e+04    -13.371 0.000   -2.38e+05   -1.77e+05
sqft_living15   33.6926 3.996   8.432   0.000   25.860  41.525
sqft_lot15  -0.4850 0.092   -5.275  0.000   -0.665  -0.305
Omnibus:    9620.580    Durbin-Watson:  1.997
Prob(Omnibus):  0.000   Jarque-Bera (JB):   363824.963
Skew:   2.694   Prob(JB):   0.00
Kurtosis:   27.021  Cond. No.   1.32e+17

Output from r2_score

0.6855578295481021

Solution

  • Your R2=0.909 is from the OLS on the train data, while the R2_score=0.68 is based on the correlation of the test data.

    Try predicting the train data and use R2_score on the train and predicted train data.