Search code examples
pythonnumpyscipylinear-regressionsklearn-pandas

Multiple Linear Regression. Coeffs don't match


So I have this small dataset and ı want to perform multiple linear regression on it.

first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression. finally I removed the outliers and did the following:

Dataset

import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model

%matplotlib inline

X = dfafter
Y = dfafter[['hours']]

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

# create a Linear Regression model object
regression_model = LinearRegression()

# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train) 


#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]

print("The intercept for our model is {}".format(intercept))
print('-'*100)

# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later 

#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)

# create a OLS model
model = sm.OLS(Y, X2)

# fit the data
est = model.fit()



# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)

# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)

# calulcate the root mean squared error
model_rmse =  math.sqrt(model_mse)

# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))


print(est.summary())
#????????? something is wrong



X = df[['miles', 'gasprice']]
y = df['hours']

regr = linear_model.LinearRegression()
regr.fit(X, y)

print(regr.coef_)

So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?


Solution

  • I see you are trying 3 different things here, so let me summarize:

    1. sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
    2. statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
    3. sklearn.linear_model.LinearRegression() with the full dataset, as in n2.

    I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.

    In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html