I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.
I generate data as follow:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
#Defining the x array.
x=np.array(range(1,100))
#Defining the y array.
y=10+2*x.ravel()
y=y+np.random.normal(loc=0, scale=70, size=99)
Then define the parameters:
alpha = 0.01 # Which will be the learning rate
NbrIter = 100 # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)
and my GD is as follow:
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta) - y) )
What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.
(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with: theta = theta - (1/m) * alpha * ( X.T @ ((X @ theta).T - y).T ) But it does look wrong and the actual value are way off (array([[-8.92647663e+148], [-5.92079000e+150]])) )
I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T@X@theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)
Now, a few optimizations:
X.T @ ((X @ theta) - y[:,None])
can be rewritten as:
(X.T@X) @ theta - (X.T*y[:,None])
The most costly computation can be taken out of the loop:
XtX = X.T@X
Xty = X.T*y[:,None]
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * (XtX @ theta - Xty)
Now you operate on 2x2 matrix rather that 100x2.
Let's take a look on convergence.
Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x))
it is possible to check matrix condition:
np.linalg.cond(XtX)
Which produced:
13475.851490419038
It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.
If you use alpha=1e-5 the algorithm will converge. Good luck!