Search code examples
pythonmachine-learningscikit-learnlinear-regression

Python Sklearn Linear Regression Yields Incorrect Coefficient Values


I'm trying to find the slope and y-intercept coefficients for a linear equation. I created a test domain and range to make sure the numbers I was receiving were correct. The equation should be y = 2x + 1, but the model is saying the slope is 24 and the y-intercept is 40.3125. The model accurately predicts every value I give it, but I'm questioning how I can get the proper values.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = np.arange(0, 40)
y = (2 * X) + 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
X_train = [[i] for i in X_train]
X_test = [[i] for i in X_test]

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

regr = linear_model.LinearRegression()

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

print('Coefficients: \n', regr.coef_)
print('Y-intercept: \n', regr.intercept_)
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred))
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))

plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
print(X_test)

plt.xticks()
plt.yticks()

plt.show()

Solution

  • This is happening because you scaled your training and testing data. So even though you generated y as a linear function of X, you converted X_train and X_test onto another scale by standardizing it (subtract the mean and divide by the standard deviation).

    If we run your code but omit the lines where you scale the data, you get the expected results.

    X = np.arange(0, 40)
    y = (2 * X) + 1
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
    X_train = [[i] for i in X_train]
    X_test = [[i] for i in X_test]
    
    # Skip the scaling of X_train and X_test
    #sc = StandardScaler()
    #X_train = sc.fit_transform(X_train)
    #X_test = sc.transform(X_test)
    
    regr = linear_model.LinearRegression()
    regr.fit(X_train, y_train)
    
    y_pred = regr.predict(X_test)
    
    print('Coefficients: \n', regr.coef_)
    > Coefficients: 
       [2.]
    print('Y-intercept: \n', regr.intercept_)
    > Y-intercept: 
       1.0