Numeric Gradient Descent in python

I wrote this code to get the gradient descent of a vector function $f:R^n to R$ .

Where: f is the function, X0 is the starting point and eta is the step size.

It is essentially made up of two parts, first that obtains the gradient of the function and the second, which iterates over x, subtracting the gradient.

the problem is that you usually have trouble converging on some functions, for example:

if we take $f(X)=(X[0]-20)**4+(X[1]-25)**4$ , the gradient descent does not converge to [20,25]

Something I need to change or add?

def descenso_grad(f,X0,eta):

    def grad(f,X):
        import numpy as np
        def partial(g,k,X):
            h=1e-9
            Y=np.copy(X)
            X[k-1]=X[k-1]+h
            dp=(g(X)-g(Y))/h
            return dp
        grd=[]
        for i in np.arange(0,len(X)):
            ai=partial(f,i+1,X)
            grd.append(ai)
        return grd
    #iterations

    i=0
    while True:
        i=i+1
        X0=X0-eta*np.array(grad(f,X0))

        if np.linalg.norm(grad(f,X0))<10e-8 or i>400: break
    return X0

Solution

Your gradient descent implementation is a good basic implementation, but your gradient sometimes oscillate and exploses. First we should precise that your gradient descent does not always diverge. For some combinations of eta and X0, it actually converges.

But first let me suggest a few edits to the code:

The import numpy as np statement should be at the top of your file, not within a function. In general, any import statement should be at the beginning of the code so that they are executed only once
It is better not to write nested functions but to separate them: you can write the partial function outside of the gradfunction, and the gradfunction outside of the descendo_grad function. it is better for debugging.
I strongly recommend to pass parameters such as the learning rate (eta), the number of steps (steps) and the tolerance (set to 10e-8 in your code, or 1e-7) as parameters to the descendo_grad function. This way you will be able to compare their influence on the result.

Anyway, here is the implementation of your code I will use in this answer:

import numpy as np

def partial(g, k, X):
    h = 1e-9
    Y = np.copy(X)
    X[k - 1] = X[k - 1] + h
    dp = (g(X) - g(Y)) / h
    return dp

def grad(f, X):
    grd = []
    for i in np.arange(0, len(X)):
        ai = partial(f, i + 1, X)
        grd.append(ai)
    return grd

def descenso_grad(f,X0,eta, steps, tolerance=1e-7):

    #iterations
    i=0
    while True:
        i=i+1
        X0=X0-eta*np.array(grad(f,X0))

        if np.linalg.norm(grad(f,X0))<tolerance or i>steps: break
    return X0

def f(X):
    return (X[0]-20)**4 + (X[1]-25)**4

Now, about the convergence. I said that your implementation didn't always diverge. Indeed:

X0 = [2, 30]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)

Will print [20.55359068 25.55258024]

But:

X0 = [2, 0]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)

Will actually diverge to [ 2.42462695e+01 -3.54879793e+10]

1) What happened

Your gradient is actually oscillating aroung the y axis. Let's compute the gradient of f at X0 = [2, 0]:

print(grad(f, X0))

We get grad(f, X0) = [-23328.00067961216, -62500.01024454831], which is quite high but in the right direction.

Now let's compute the next step of gradient descent:

eta = 0.001
X1=X0-eta*np.array(grad(f,X0))
print(X1)

We get X1 = [25.32800068 62.50001025]. We can see that on the x axis, we actually get closer to the minimal, but on the y axis, the gradient descent jumped to the other side of the minimal and went even further from it. Indeed, X0[1] was at a disctance of 25 from the minimal (X0[1] - Xmin[1] = 25) at its left while X0[1] is now at a distance of 65-25 = 40 but on its right*. Since the curve drawn by f has a simple U shape around the y axis, the value taken by f in X1 will be higher than before (to simplify, we ignore the influence of the x coordinate).

If we look at the next steps, we can clearly see the exploding oscillations around the minimal:

X0 = [2, 0]
eta = 0.001
steps = 10

#record values taken by X[1] during gradient descent
curve_y = [X0[1]]

i = 0
while True:
    i = i + 1
    X0 = X0 - eta * np.array(grad(f, X0))
    curve_y.append(X0[1])

    if np.linalg.norm(grad(f, X0)) < 10e-8 or i > steps: break

print(curve_y)

We get [0, 62.50001024554831, -148.43710232226067, 20719.6258707022, -35487979280.37413]. We can see that X1 gets further and further from the minimal while oscillating around it.

In order to illustrate this, let's assume that the value along the x axis is fixed, and look only at what happens on the y axis. The picture shows in black the oscillations of the function's values taken at each step of the gradient descent (synthetic data for the purpose of illustrating only). The gradient descent takes us further from the minimal at each step because the update value is too large:

Note that the gradient descent we gave as an example makes only 5 steps while we programmed 10 steps. This is because when the values taken by the function are too high, python does not succeed to make the difference between f(X[1]) and f(X[1]+h), so it computes a gradient equal to zero:

x = 24 # for the example
y = -35487979280.37413
z = f([x, y+h]) - f([x, y])
print(z)

We get 0.0. This issue is about the computer's computation precision, but we will get back to this later.

So, these oscillations are due to the combination of:

the very high value of the partial gradient with regards to the y axis
a too big value of eta that does not compensate the exploding gradient in the update.

If this is true, we might converge if we use a smaller learning rate. Let's check:

X0 = [2, 0]
# divide eta by 100
eta = 0.0001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)

We will get [18.25061287 23.24796497]. We might need more steps but we are converging this time!!

2) How to avoid that?

A) In your specific case

Since the function shape is simple and it has no local minimas or no saddle points, we can avoid this issue by simply clipping the gradient value. This means that we define a maximum value for the norm of the gradients:


def grad_clipped(f, X, clip):
    grd = []
    for i in np.arange(0, len(X)):
        ai = partial(f, i + 1, X)
        if ai<0:
            ai = max(ai, -1*clip)
        else:
            ai = min(ai, clip)
        grd.append(ai)
    return grd

def descenso_grad_clipped(f,X0,eta, steps, clip=100, tolerance=10e-8):

    #iterations
    i=0
    while True:
        i=i+1
        X0=X0-eta*np.array(grad_clipped(f,X0, clip))

        if np.linalg.norm(grad_clipped(f,X0, clip))<tolerance or i>steps: break
    return X0

Let's test it using the diverging example:

X0 = [2, 0]
eta = 0.001
steps = 400
clip=100
xmin = descenso_grad_clipped(f, X0, eta, clip, steps)
print(xmin)

This time we are converging: [19.31583901 24.20307188]. Note that this can slow the process since the gradient descend will take smaller steps. Here we can get closer to the real minimum by increasing the number of steps.

Note that this technique also solves the numerical calculous issue we faced when the function's value was too high.

B) In general

In general, there are a lot of caveats the gradient descent allgorithms try to avoid (exploding or vanishing gradients, saddle points, local minimas...). Backpropagation algorithms like Adam, RMSprop, Adagrad, etc try to avoid these caveats.

I am not going to dwelve into the details because this would deserve a whole article, however here are two resources you can use (I suggest to read them in the order given) to deepen your understanding of the topic:

A good article on towardsdatascience.com explaining the basics of gradients descents and its most common flaws
An overview of gradient descent algorithms