I wrote this code to get the gradient descent of a vector function .
Where: f is the function, X0 is the starting point and eta is the step size.
It is essentially made up of two parts, first that obtains the gradient of the function and the second, which iterates over x, subtracting the gradient.
the problem is that you usually have trouble converging on some functions, for example:
if we take , the gradient descent does not converge to [20,25]
Something I need to change or add?
def descenso_grad(f,X0,eta):
def grad(f,X):
import numpy as np
def partial(g,k,X):
h=1e-9
Y=np.copy(X)
X[k-1]=X[k-1]+h
dp=(g(X)-g(Y))/h
return dp
grd=[]
for i in np.arange(0,len(X)):
ai=partial(f,i+1,X)
grd.append(ai)
return grd
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad(f,X0))
if np.linalg.norm(grad(f,X0))<10e-8 or i>400: break
return X0
Your gradient descent implementation is a good basic implementation, but your gradient sometimes oscillate and exploses. First we should precise that your gradient descent does not always diverge. For some combinations of eta
and X0
, it actually converges.
But first let me suggest a few edits to the code:
import numpy as np
statement should be at the top of your file, not within a function. In general, any import statement should be at the beginning of the code so that they are executed only oncepartial
function outside of the grad
function, and the grad
function outside of the descendo_grad
function. it is better for debugging.eta
), the number of steps (steps
) and the tolerance (set to 10e-8 in your code, or 1e-7) as parameters to the descendo_grad
function. This way you will be able to compare their influence on the result.Anyway, here is the implementation of your code I will use in this answer:
import numpy as np
def partial(g, k, X):
h = 1e-9
Y = np.copy(X)
X[k - 1] = X[k - 1] + h
dp = (g(X) - g(Y)) / h
return dp
def grad(f, X):
grd = []
for i in np.arange(0, len(X)):
ai = partial(f, i + 1, X)
grd.append(ai)
return grd
def descenso_grad(f,X0,eta, steps, tolerance=1e-7):
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad(f,X0))
if np.linalg.norm(grad(f,X0))<tolerance or i>steps: break
return X0
def f(X):
return (X[0]-20)**4 + (X[1]-25)**4
Now, about the convergence. I said that your implementation didn't always diverge. Indeed:
X0 = [2, 30]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
Will print [20.55359068 25.55258024]
But:
X0 = [2, 0]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
Will actually diverge to [ 2.42462695e+01 -3.54879793e+10]
Your gradient is actually oscillating aroung the y axis. Let's compute the gradient of f
at X0 = [2, 0]
:
print(grad(f, X0))
We get grad(f, X0) = [-23328.00067961216, -62500.01024454831]
, which is quite high but in the right direction.
Now let's compute the next step of gradient descent:
eta = 0.001
X1=X0-eta*np.array(grad(f,X0))
print(X1)
We get X1 = [25.32800068 62.50001025]
. We can see that on the x axis, we actually get closer to the minimal, but on the y axis, the gradient descent jumped to the other side of the minimal and went even further from it. Indeed, X0[1]
was at a disctance of 25 from the minimal (X0[1] - Xmin[1] = 25
) at its left while X0[1]
is now at a distance of 65-25 = 40
but on its right*. Since the curve drawn by f
has a simple U shape around the y axis, the value taken by f
in X1
will be higher than before (to simplify, we ignore the influence of the x coordinate).
If we look at the next steps, we can clearly see the exploding oscillations around the minimal:
X0 = [2, 0]
eta = 0.001
steps = 10
#record values taken by X[1] during gradient descent
curve_y = [X0[1]]
i = 0
while True:
i = i + 1
X0 = X0 - eta * np.array(grad(f, X0))
curve_y.append(X0[1])
if np.linalg.norm(grad(f, X0)) < 10e-8 or i > steps: break
print(curve_y)
We get [0, 62.50001024554831, -148.43710232226067, 20719.6258707022, -35487979280.37413]
. We can see that X1 gets further and further from the minimal while oscillating around it.
In order to illustrate this, let's assume that the value along the x axis is fixed, and look only at what happens on the y axis. The picture shows in black the oscillations of the function's values taken at each step of the gradient descent (synthetic data for the purpose of illustrating only). The gradient descent takes us further from the minimal at each step because the update value is too large:
Note that the gradient descent we gave as an example makes only 5 steps while we programmed 10 steps. This is because when the values taken by the function are too high, python does not succeed to make the difference between f(X[1])
and f(X[1]+h)
, so it computes a gradient equal to zero:
x = 24 # for the example
y = -35487979280.37413
z = f([x, y+h]) - f([x, y])
print(z)
We get 0.0
. This issue is about the computer's computation precision, but we will get back to this later.
So, these oscillations are due to the combination of:
eta
that does not compensate the exploding gradient in the update.If this is true, we might converge if we use a smaller learning rate. Let's check:
X0 = [2, 0]
# divide eta by 100
eta = 0.0001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
We will get [18.25061287 23.24796497]
. We might need more steps but we are converging this time!!
Since the function shape is simple and it has no local minimas or no saddle points, we can avoid this issue by simply clipping the gradient value. This means that we define a maximum value for the norm of the gradients:
def grad_clipped(f, X, clip):
grd = []
for i in np.arange(0, len(X)):
ai = partial(f, i + 1, X)
if ai<0:
ai = max(ai, -1*clip)
else:
ai = min(ai, clip)
grd.append(ai)
return grd
def descenso_grad_clipped(f,X0,eta, steps, clip=100, tolerance=10e-8):
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad_clipped(f,X0, clip))
if np.linalg.norm(grad_clipped(f,X0, clip))<tolerance or i>steps: break
return X0
Let's test it using the diverging example:
X0 = [2, 0]
eta = 0.001
steps = 400
clip=100
xmin = descenso_grad_clipped(f, X0, eta, clip, steps)
print(xmin)
This time we are converging: [19.31583901 24.20307188]
. Note that this can slow the process since the gradient descend will take smaller steps. Here we can get closer to the real minimum by increasing the number of steps.
Note that this technique also solves the numerical calculous issue we faced when the function's value was too high.
In general, there are a lot of caveats the gradient descent allgorithms try to avoid (exploding or vanishing gradients, saddle points, local minimas...). Backpropagation algorithms like Adam, RMSprop, Adagrad, etc try to avoid these caveats.
I am not going to dwelve into the details because this would deserve a whole article, however here are two resources you can use (I suggest to read them in the order given) to deepen your understanding of the topic: