python numpy machine-learning scikit-learn gradient-descent

Numpy based gradient descent not fully converging

I believe I have implemented GD correctly (partially based on Aurelien Geron's book), but it is not returning the same result as sklearn's Linear Regression. Here is the full notebook: https://colab.research.google.com/drive/17lvCb_F_vMskT1PxbrKCSR57B5lMWT7A?usp=sharing

I'm not doing anything fancy, here is the code to load training data:

import numpy as np
import pandas as pd
import sklearn.datasets

#load data
data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values

X_raw = data_arr[:,1:] 
y_raw = data_arr[:, 1:2]

#add bias
X = np.hstack((np.ones(y_raw.shape),X_raw))
y = y_raw

#do gradient descent
learning_rate = 0.001
iterations = 1_000_000

observations = X.shape[0]
features = X.shape[1]

w = np.ones((features,1))

for i in range(iterations):
    w -= (learning_rate) * (2/observations) * X.T.dot(X.dot(w) - y)

Here are the weights this produced:

array([[ 2.72774600e-17],
       [ 1.01847403e+00],
       [ 3.87858604e-02],
       [ 3.06547577e-04],
       [-3.67525543e-01],
       [ 9.09006216e-02],
       [ 4.21512716e-01],
       [ 4.25673672e-01],
       [ 4.77147289e-02],
       [-8.14471370e-03]])

And the MSE: 5.24937033143115e-05

Here is what sklearn gives me:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

%time reg = LinearRegression().fit(X, y)
reg.coef_

sklearn weights:

array([[ 0.00000000e+00,  1.00000000e+00, -9.99200722e-16,
        -1.69309011e-15, -1.11022302e-16,  1.38777878e-15,
        -3.88578059e-16,  6.80011603e-16, -8.32667268e-17,
        -5.55111512e-16]])

sklearn MSE: 1.697650600978984e-32

I've tried to increase/decrease the number of epochs and size of learning rate. Scikit-learn returns the results in a few milliseconds. My GD implementation can run for minutes and still not get anywhere close to sklearn's results.

Am I doing something obviously wrong here?

(The notebook contains cleaner version of this code.)

Solution

There is a small bug in your code as the first column of X_raw is the same as y_raw, i.e. the target is being used as a feature. This has been corrected in the code below.

Another issue is that if you include a column of ones in the feature matrix X, then when fitting the linear regression with sklearn you should make sure to set fit_intercept=False, otherwise you will have two columns of ones in the feature matrix.

It is also not clear why you are dividing by the number of observations in the gradient update as this scales down the learning rate significantly.

import numpy as np
import pandas as pd
import sklearn.datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# load data
data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values

# extract features and target
X_raw = data_arr[:, 1:]
y_raw = data_arr[:, :1]

# add bias
X = np.hstack((np.ones(y_raw.shape), X_raw))
y = y_raw

# do gradient descent
learning_rate = 0.001
iterations = 1000000

observations = X.shape[0]
features = X.shape[1]

w = np.ones((features, 1))

for i in range(iterations):
    w -= 2 * learning_rate * X.T.dot(X.dot(w) - y)

# exclude the intercept as X already contains a column of ones
reg = LinearRegression(fit_intercept=False).fit(X, y)

# compare the estimated coefficients
res = pd.DataFrame({
    'manual': [format(x, '.6f') for x in w.flatten()],
    'sklearn': [format(x, '.6f') for x in reg.coef_.flatten()]
})

res
#       manual    sklearn
# 0  -0.000000  -0.000000
# 1   0.101424   0.101424
# 2  -0.006468  -0.006468
# 3   0.208211   0.208211
# 4  -0.128653  -0.128653
# 5   0.236556   0.236556
# 6   0.132544   0.132544
# 7  -0.039359  -0.039359
# 8   0.177129   0.177129
# 9   0.145396   0.145396

# compare the RMSE
print(format(mean_squared_error(y, X.dot(w), squared=False), '.6f'))
# 0.043111

print(format(mean_squared_error(y, reg.predict(X), squared=False), '.6f'))
# 0.043111