python machine-learning least-squares gradient-descent

Issue with gradient descent implementation of linear regression

I am taking this Coursera class on machine learning / linear regression. Here is how they describe the gradient descent algorithm for solving for the estimated OLS coefficients:

So they use w for the coefficients, H for the design matrix (or features as they call it), and y for the dependent variable. And their convergence criteria is the usual of the norm of the gradient of RSS being less than tolerance epsilon; that is, their definition of "not converged" is:

I am having trouble getting this algorithm to converge and was wondering if I was overlooking something in my implementation. Below is the code. Please note that I also ran the sample dataset I use in it (df) through the statsmodels regression library, just to see that a regression could converge and to get coefficient values to tie out with. It did and they were:

Intercept    4.344435
x1           4.387702
x2           0.450958

Here is my implementation. At each iteration, it prints the norm of the gradient of RSS:

import numpy as np
import numpy.linalg as LA
import pandas as pd
from pandas import DataFrame

# First define the grad function: grad(RSS) = -2H'(y-Hw)
def grad_rss(df, var_name_y, var_names_h, w):
    # Set up feature matrix H
    H = DataFrame({"Intercept" : [1 for i in range(0,len(df))]})
    for var_name_h in var_names_h:
        H[var_name_h] = df[var_name_h]

    # Set up y vector
    y = df[var_name_y]

    # Calculate the gradient of the RSS:  -2H'(y - Hw)
    result = -2 * np.transpose(H.values) @ (y.values - H.values @ w)

    return result

def ols_gradient_descent(df, var_name_y, var_names_h, epsilon = 0.0001, eta = 0.05):
    # Set all initial w values to 0.0001 (not related to our choice of epsilon)
    w = np.array([0.0001 for i in range(0, len(var_names_h) + 1)])

    # Iteration counter
    t = 0

    # Basic algorithm: keep subtracting eta * grad(RSS) from w until
    # ||grad(RSS)|| < epsilon.
    while True:
        t = t + 1

        grad = grad_rss(df, var_name_y, var_names_h, w)
        norm_grad = LA.norm(grad)

        if norm_grad < epsilon:
            break
        else:
            print("{} : {}".format(t, norm_grad))
            w = w - eta * grad

            if t > 10:
                raise Exception ("Failed to converge")

    return w

# ########################################## 

df = DataFrame({
        "y" : [20,40,60,80,100] ,
        "x1" : [1,5,7,9,11] ,
        "x2" : [23,29,60,85,99]         
    })

# Run
ols_gradient_descent(df, "y", ["x1", "x2"])

Unfortunately this does not converge, and in fact prints a norm that is exploding with each iteration:

1 : 44114.31506051333
2 : 98203544.03067812
3 : 218612547944.95386
4 : 486657040646682.9
5 : 1.083355358314664e+18
6 : 2.411675439503567e+21
7 : 5.368670935963926e+24
8 : 1.1951287949674022e+28
9 : 2.660496151835357e+31
10 : 5.922574875391406e+34
11 : 1.3184342751414824e+38
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
......
Exception: Failed to converge

If I increase the maximum number of iterations enough, it doesn't converge, but just blows out to infinity.

Is there an implementation error here, or am I misinterpreting the explanation in the class notes?

Updated w/ Answer

As @Kant suggested, the eta needs to updated at each iteration. The course itself had some sample formulas for this but none of them helped in the convergence. This section of the Wikipedia page about gradient descent mentions the Barzilai-Borwein approach as a good way of updating the eta. I implemented it and altered my code to update the eta with it at each iteration, and the regression converged successfully. Below is my translation of the Wikipedia version of the formula to the variables used in regression, as well as code that implements it. Again, this code is called in the loop of my original ols_gradient_descent to update the eta.

def eta_t (w_t, w_t_minus_1, grad_t, grad_t_minus_1):
    delta_w = w_t - w_t_minus_1
    delta_grad = grad_t - grad_t_minus_1

    eta_t = (delta_w.T @ delta_grad) / (LA.norm(delta_grad))**2

    return eta_t

Solution

Try decreasing the value of eta. Gradient descent can diverge if eta is too high.