Search code examples
pythonscikit-learnlinear-regressionstatsmodels

Different Results using Simple Linear Regression Packages in Python: statsmodel.api vs sklearn


I am hoping to understand why I am getting two different results for a linear regression model prediction. I am using the same data set, and asking for the same value for prediction. I have pasted some example code below, with a link as well to an open Google Colab, available here.

import pandas as pd
from sklearn import linear_model, metrics
import statsmodels.api as sm

temp = [73,65,81,90,75,77,82,93,86,79]
gallons = [110,95,135,160,97,105,120,175,140,121]
merged = list(zip(temp, gallons))
df = pd.DataFrame(merged, columns = ['temp', 'gallons'])

X = df[['temp']]
Y = df['gallons']

regr = linear_model.LinearRegression().fit(X,Y)
print("Using sklearn package, 80 temp predicts rent of:", regr.predict([[80]]))

model = sm.OLS(Y,X).fit()
print("Using statsmodel.api package, 80 temp predicts rent of:", model.predict([80]))

With the above code, I receive a result of:
Using sklearn package, 80 temp predicts rent of: [125.5013734]
Using statsmodel.api package, 80 temp predicts rent of: [126.72501891]

Can someone explain why the result is not the same? My understanding is that they are both linear regression models.

Thank you!


Solution

  • Statsmodel doesn't use intercept by default while sklearn use it by default.You have to add intercept manually in statsmodel.

    Statsmodel OLS documentation.

    Notes

    No constant is added by the model unless you are using formulas.

    Sklearn

    fit_interceptbool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

    use add_constant function to add intercept to the X and this will give the same results for both algorithms.

    X = sm.add_constant(X)
    model = sm.OLS(Y,X).fit()
    print("Using statsmodel.api package, 80 temp predicts rent of:", model.predict([1,80]))