python numpy optimization gradient-descent

Gradient descent from scratch in python not working

I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.

I generate data as follow:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

#Defining the x array. 
x=np.array(range(1,100)) 

#Defining the y array. 
y=10+2*x.ravel() 
y=y+np.random.normal(loc=0, scale=70, size=99)

Then define the parameters:

alpha = 0.01  # Which will be the learning rate
NbrIter = 100  # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)

and my GD is as follow:

for iter in range(NbrIter):
    theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta) - y) )

What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.

(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with: theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta).T - y).T ) But it does look wrong and the actual value are way off (array([[-8.92647663e+148], [-5.92079000e+150]])) )

Solution

I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T@X@theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)

Now, a few optimizations:

X.T @ ((X @ theta) - y[:,None])

can be rewritten as:

(X.T@X) @ theta - (X.T*y[:,None])

The most costly computation can be taken out of the loop:

XtX = X.T@X
Xty = X.T*y[:,None]

for iter in range(NbrIter):
    theta = theta - (1/m) * alpha * (XtX @ theta - Xty)

Now you operate on 2x2 matrix rather that 100x2.

Let's take a look on convergence. Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x)) it is possible to check matrix condition:

np.linalg.cond(XtX)

Which produced: 13475.851490419038

It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.

If you use alpha=1e-5 the algorithm will converge. Good luck!