Search code examples
pythondeep-learninggradient-descent

Numeric Gradient Descent in python


I wrote this code to get the gradient descent of a vector function f:R^n to R.

Where: f is the function, X0 is the starting point and eta is the step size.

It is essentially made up of two parts, first that obtains the gradient of the function and the second, which iterates over x, subtracting the gradient.

the problem is that you usually have trouble converging on some functions, for example:

if we take f(X)=(X[0]-20)**4+(X[1]-25)**4, the gradient descent does not converge to [20,25]

Something I need to change or add?

def descenso_grad(f,X0,eta):

    def grad(f,X):
        import numpy as np
        def partial(g,k,X):
            h=1e-9
            Y=np.copy(X)
            X[k-1]=X[k-1]+h
            dp=(g(X)-g(Y))/h
            return dp
        grd=[]
        for i in np.arange(0,len(X)):
            ai=partial(f,i+1,X)
            grd.append(ai)
        return grd
    #iterations

    i=0
    while True:
        i=i+1
        X0=X0-eta*np.array(grad(f,X0))

        if np.linalg.norm(grad(f,X0))<10e-8 or i>400: break
    return X0

Solution

  • Your gradient descent implementation is a good basic implementation, but your gradient sometimes oscillate and exploses. First we should precise that your gradient descent does not always diverge. For some combinations of eta and X0, it actually converges.

    But first let me suggest a few edits to the code:

    • The import numpy as np statement should be at the top of your file, not within a function. In general, any import statement should be at the beginning of the code so that they are executed only once
    • It is better not to write nested functions but to separate them: you can write the partial function outside of the gradfunction, and the gradfunction outside of the descendo_grad function. it is better for debugging.
    • I strongly recommend to pass parameters such as the learning rate (eta), the number of steps (steps) and the tolerance (set to 10e-8 in your code, or 1e-7) as parameters to the descendo_grad function. This way you will be able to compare their influence on the result.

    Anyway, here is the implementation of your code I will use in this answer:

    import numpy as np
    
    def partial(g, k, X):
        h = 1e-9
        Y = np.copy(X)
        X[k - 1] = X[k - 1] + h
        dp = (g(X) - g(Y)) / h
        return dp
    
    def grad(f, X):
        grd = []
        for i in np.arange(0, len(X)):
            ai = partial(f, i + 1, X)
            grd.append(ai)
        return grd
    
    def descenso_grad(f,X0,eta, steps, tolerance=1e-7):
    
        #iterations
        i=0
        while True:
            i=i+1
            X0=X0-eta*np.array(grad(f,X0))
    
            if np.linalg.norm(grad(f,X0))<tolerance or i>steps: break
        return X0
    
    def f(X):
        return (X[0]-20)**4 + (X[1]-25)**4
    
    

    Now, about the convergence. I said that your implementation didn't always diverge. Indeed:

    X0 = [2, 30]
    eta = 0.001
    steps = 400
    xmin = descenso_grad(f, X0, eta, steps)
    print(xmin) 
    

    Will print [20.55359068 25.55258024]

    But:

    X0 = [2, 0]
    eta = 0.001
    steps = 400
    xmin = descenso_grad(f, X0, eta, steps)
    print(xmin)
    

    Will actually diverge to [ 2.42462695e+01 -3.54879793e+10]

    1) What happened

    Your gradient is actually oscillating aroung the y axis. Let's compute the gradient of f at X0 = [2, 0]:

    print(grad(f, X0))
    

    We get grad(f, X0) = [-23328.00067961216, -62500.01024454831], which is quite high but in the right direction.

    Now let's compute the next step of gradient descent:

    eta = 0.001
    X1=X0-eta*np.array(grad(f,X0))
    print(X1)
    

    We get X1 = [25.32800068 62.50001025]. We can see that on the x axis, we actually get closer to the minimal, but on the y axis, the gradient descent jumped to the other side of the minimal and went even further from it. Indeed, X0[1] was at a disctance of 25 from the minimal (X0[1] - Xmin[1] = 25) at its left while X0[1] is now at a distance of 65-25 = 40 but on its right*. Since the curve drawn by f has a simple U shape around the y axis, the value taken by f in X1 will be higher than before (to simplify, we ignore the influence of the x coordinate).

    If we look at the next steps, we can clearly see the exploding oscillations around the minimal:

    X0 = [2, 0]
    eta = 0.001
    steps = 10
    
    #record values taken by X[1] during gradient descent
    curve_y = [X0[1]]
    
    i = 0
    while True:
        i = i + 1
        X0 = X0 - eta * np.array(grad(f, X0))
        curve_y.append(X0[1])
    
        if np.linalg.norm(grad(f, X0)) < 10e-8 or i > steps: break
    
    print(curve_y)
    

    We get [0, 62.50001024554831, -148.43710232226067, 20719.6258707022, -35487979280.37413]. We can see that X1 gets further and further from the minimal while oscillating around it.

    In order to illustrate this, let's assume that the value along the x axis is fixed, and look only at what happens on the y axis. The picture shows in black the oscillations of the function's values taken at each step of the gradient descent (synthetic data for the purpose of illustrating only). The gradient descent takes us further from the minimal at each step because the update value is too large:

    Oscillation of the gradient around the y axis

    Note that the gradient descent we gave as an example makes only 5 steps while we programmed 10 steps. This is because when the values taken by the function are too high, python does not succeed to make the difference between f(X[1]) and f(X[1]+h), so it computes a gradient equal to zero:

    x = 24 # for the example
    y = -35487979280.37413
    z = f([x, y+h]) - f([x, y])
    print(z)
    

    We get 0.0. This issue is about the computer's computation precision, but we will get back to this later.

    So, these oscillations are due to the combination of:

    • the very high value of the partial gradient with regards to the y axis
    • a too big value of eta that does not compensate the exploding gradient in the update.

    If this is true, we might converge if we use a smaller learning rate. Let's check:

    X0 = [2, 0]
    # divide eta by 100
    eta = 0.0001
    steps = 400
    xmin = descenso_grad(f, X0, eta, steps)
    print(xmin)
    

    We will get [18.25061287 23.24796497]. We might need more steps but we are converging this time!!

    2) How to avoid that?

    A) In your specific case

    Since the function shape is simple and it has no local minimas or no saddle points, we can avoid this issue by simply clipping the gradient value. This means that we define a maximum value for the norm of the gradients:

    
    def grad_clipped(f, X, clip):
        grd = []
        for i in np.arange(0, len(X)):
            ai = partial(f, i + 1, X)
            if ai<0:
                ai = max(ai, -1*clip)
            else:
                ai = min(ai, clip)
            grd.append(ai)
        return grd
    
    def descenso_grad_clipped(f,X0,eta, steps, clip=100, tolerance=10e-8):
    
        #iterations
        i=0
        while True:
            i=i+1
            X0=X0-eta*np.array(grad_clipped(f,X0, clip))
    
            if np.linalg.norm(grad_clipped(f,X0, clip))<tolerance or i>steps: break
        return X0
    
    

    Let's test it using the diverging example:

    X0 = [2, 0]
    eta = 0.001
    steps = 400
    clip=100
    xmin = descenso_grad_clipped(f, X0, eta, clip, steps)
    print(xmin)
    

    This time we are converging: [19.31583901 24.20307188]. Note that this can slow the process since the gradient descend will take smaller steps. Here we can get closer to the real minimum by increasing the number of steps.

    Note that this technique also solves the numerical calculous issue we faced when the function's value was too high.

    B) In general

    In general, there are a lot of caveats the gradient descent allgorithms try to avoid (exploding or vanishing gradients, saddle points, local minimas...). Backpropagation algorithms like Adam, RMSprop, Adagrad, etc try to avoid these caveats.

    I am not going to dwelve into the details because this would deserve a whole article, however here are two resources you can use (I suggest to read them in the order given) to deepen your understanding of the topic: