Search code examples
pythonnumpymachine-learningscikit-learngradient-descent

Numpy based gradient descent not fully converging


I believe I have implemented GD correctly (partially based on Aurelien Geron's book), but it is not returning the same result as sklearn's Linear Regression. Here is the full notebook: https://colab.research.google.com/drive/17lvCb_F_vMskT1PxbrKCSR57B5lMWT7A?usp=sharing

I'm not doing anything fancy, here is the code to load training data:

import numpy as np
import pandas as pd
import sklearn.datasets

#load data
data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values

X_raw = data_arr[:,1:] 
y_raw = data_arr[:, 1:2]

#add bias
X = np.hstack((np.ones(y_raw.shape),X_raw))
y = y_raw

#do gradient descent
learning_rate = 0.001
iterations = 1_000_000

observations = X.shape[0]
features = X.shape[1]

w = np.ones((features,1))

for i in range(iterations):
    w -= (learning_rate) * (2/observations) * X.T.dot(X.dot(w) - y)

Here are the weights this produced:

array([[ 2.72774600e-17],
       [ 1.01847403e+00],
       [ 3.87858604e-02],
       [ 3.06547577e-04],
       [-3.67525543e-01],
       [ 9.09006216e-02],
       [ 4.21512716e-01],
       [ 4.25673672e-01],
       [ 4.77147289e-02],
       [-8.14471370e-03]])

And the MSE: 5.24937033143115e-05

Here is what sklearn gives me:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

%time reg = LinearRegression().fit(X, y)
reg.coef_

sklearn weights:

array([[ 0.00000000e+00,  1.00000000e+00, -9.99200722e-16,
        -1.69309011e-15, -1.11022302e-16,  1.38777878e-15,
        -3.88578059e-16,  6.80011603e-16, -8.32667268e-17,
        -5.55111512e-16]])

sklearn MSE: 1.697650600978984e-32

I've tried to increase/decrease the number of epochs and size of learning rate. Scikit-learn returns the results in a few milliseconds. My GD implementation can run for minutes and still not get anywhere close to sklearn's results.

Am I doing something obviously wrong here?

(The notebook contains cleaner version of this code.)


Solution

  • There is a small bug in your code as the first column of X_raw is the same as y_raw, i.e. the target is being used as a feature. This has been corrected in the code below.

    Another issue is that if you include a column of ones in the feature matrix X, then when fitting the linear regression with sklearn you should make sure to set fit_intercept=False, otherwise you will have two columns of ones in the feature matrix.

    It is also not clear why you are dividing by the number of observations in the gradient update as this scales down the learning rate significantly.

    import numpy as np
    import pandas as pd
    import sklearn.datasets
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    
    # load data
    data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values
    
    # extract features and target
    X_raw = data_arr[:, 1:]
    y_raw = data_arr[:, :1]
    
    # add bias
    X = np.hstack((np.ones(y_raw.shape), X_raw))
    y = y_raw
    
    # do gradient descent
    learning_rate = 0.001
    iterations = 1000000
    
    observations = X.shape[0]
    features = X.shape[1]
    
    w = np.ones((features, 1))
    
    for i in range(iterations):
        w -= 2 * learning_rate * X.T.dot(X.dot(w) - y)
    
    # exclude the intercept as X already contains a column of ones
    reg = LinearRegression(fit_intercept=False).fit(X, y)
    
    # compare the estimated coefficients
    res = pd.DataFrame({
        'manual': [format(x, '.6f') for x in w.flatten()],
        'sklearn': [format(x, '.6f') for x in reg.coef_.flatten()]
    })
    
    res
    #       manual    sklearn
    # 0  -0.000000  -0.000000
    # 1   0.101424   0.101424
    # 2  -0.006468  -0.006468
    # 3   0.208211   0.208211
    # 4  -0.128653  -0.128653
    # 5   0.236556   0.236556
    # 6   0.132544   0.132544
    # 7  -0.039359  -0.039359
    # 8   0.177129   0.177129
    # 9   0.145396   0.145396
    
    # compare the RMSE
    print(format(mean_squared_error(y, X.dot(w), squared=False), '.6f'))
    # 0.043111
    
    print(format(mean_squared_error(y, reg.predict(X), squared=False), '.6f'))
    # 0.043111