Self-Implementation of Gradient Descent Compared to SciPy Minimize

This is an assignment for a convex optimization class that I'm taking. The assignment is as follows:

Implement the gradient descent algorithm with backtracking line search to find the optimal step size. Your implementation will be compared to Python's scipy.optimize.minimize function.

The specific function to minimize is the least squares function. The error between the solution found by the Python library and your implementation must be smaller than 0.001.

I've made an implementation but the error value hovers around 1, and have been searching for ways to improve it but have been having some trouble. Here's the code that I wrote:

Gradient Descent + Backtracking Line Search Implementation

import numpy as np

# Gradient descent.
def min_gd(fun, x0, grad, args=()):
    alpha = 0.3
    beta = 0.8

    delta_x = -grad(x0, *args)
    t = backtracking_line_search(fun, x0, grad, delta_x, alpha, beta, args)
    x_new = x0 + (t * delta_x)

    if np.linalg.norm(x_new) ** 2 > np.linalg.norm(x0) ** 2:
        return min_gd(fun, x_new, grad, args)
    else:
        return x_new
    
# Line search function returns optimal step size.
def backtracking_line_search(fun, x, grad, delta_x, alpha, beta, args=()):
    t = 1
    derprod = grad(x, *args) @ delta_x

    while fun((x + (t * delta_x)), *args) > fun(x, *args) + (alpha * t * derprod):
        t *= beta

    return t

Other given functions

import numpy as np
from scipy.optimize import minimize
import gd

# Least Squares function
def LeastSquares(x, A, b):
    return np.linalg.norm(A @ x - b) ** 2

# gradient  
def grad_LeastSquares(x, A, b):
    return 2 * ((A.T @ A) @ x - A.T @ b)

The error between the two results is basically calculated using the L2-norm.

Some ideas that I've came up with are that the tolerance checking point in my gradient descent function may be flawed. Right now I'm essentially checking simply whether the next step is larger than the previous. However, I'm also having trouble wrapping my head around how I might improve that.

Any feedback is appreciated.

Edit

In case anyone is curious on the final code that I wrote to make it work in the desired way:

def min_gd(fun, x0, grad, args=()):
    alpha = 0.3
    beta = 0.8

    delta_x = -grad(x0, *args)
    t = backtracking_line_search(fun, x0, grad, delta_x, alpha, beta, args)
    x_new = x0 + (t * delta_x)
    
    if np.linalg.norm(grad(x_new, *args)) < 0.01:
        return x_new
    else:
        return min_gd(fun, x_new, grad, args)

I simply fixed the conditional statement so that I'm not just comparing norms, but also checking if the value is smaller than a predetermined tolerance level.

Hope this helps anyone in the future.

Solution

Your guess about the tolerance checking is right: the norm of the current vector is not related to convergence. A typical criterion would be a small gradient, so min_gd should look like

def min_gd(fun, x0, grad, args=()):
    alpha = 0.3
    beta = 0.8
    eps = 0.001

    x_new = x0
    delta_x = -grad(x0, *args)
    while np.linalg.norm(delta_x) > eps:
        t = backtracking_line_search(fun, x_new, grad, delta_x, alpha, beta, args)
        x_new = x_new + (t * delta_x)
        delta_x = -grad(x_new, *args)

    return x_new

where eps is some small positive tolerance.