Search code examples
pythonnumpyoptimizationgradient-descent

Gradient descent from scratch in python not working


I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.

I generate data as follow:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

#Defining the x array. 
x=np.array(range(1,100)) 

#Defining the y array. 
y=10+2*x.ravel() 
y=y+np.random.normal(loc=0, scale=70, size=99)

Then define the parameters:

alpha = 0.01  # Which will be the learning rate
NbrIter = 100  # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)

and my GD is as follow:

for iter in range(NbrIter):
    theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta) - y) )

What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.

(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with: theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta).T - y).T ) But it does look wrong and the actual value are way off (array([[-8.92647663e+148], [-5.92079000e+150]])) )


Solution

  • I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T@X@theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)

    Now, a few optimizations:

    X.T @ ((X @ theta) - y[:,None])
    

    can be rewritten as:

    (X.T@X) @ theta - (X.T*y[:,None])
    

    The most costly computation can be taken out of the loop:

    XtX = X.T@X
    Xty = X.T*y[:,None]
    
    for iter in range(NbrIter):
        theta = theta - (1/m) * alpha * (XtX @ theta - Xty)
    

    Now you operate on 2x2 matrix rather that 100x2.

    Let's take a look on convergence. Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x)) it is possible to check matrix condition:

    np.linalg.cond(XtX)

    Which produced: 13475.851490419038

    It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.

    If you use alpha=1e-5 the algorithm will converge. Good luck!