machine-learning linear-regression derivative differentiation gradient-descent

Why derivative of a function is used to calculate Local Minimum instead of the actual function?

In Machine learning regression problem, why the local minimum is computed for a derivative function instead of the actual function?

Example: http://en.wikipedia.org/wiki/Gradient_descent

The gradient descent algorithm is applied to find a local minimum of the function $$

f(x)=x^4−3x^3+2, ----(A)

with derivative

f'(x)=4x^3−9x^2. ----(B)

Here to find the local minimum using gradient descent algorithm for the function (A) they have used the derivative function of (A) which is function (B).

Solution

The reason is that because the function is concave (or convex if you're doing maximisation---these problems are equivalent), you know that there is a single minimum (maximum). This means that there is a single point where the gradient is equal to zero. There are techniques that use the function itself, but if you can compute the gradient, you can converge much faster because you can think of the gradient giving you information about how far you are from the optimum.

As well as Gradient Descent, there's an optimisation method known as Newton's method, which requires computation of the second derivative (the Hessian in multi-variate optimisation). This converges even faster still, but requires you to be able to invert the Hessian which is not feasible if you have a lot of parameters. So there are methods to get around this which compute a limited memory approximation of the Hessian. These methods converge faster still because they're using information about the curvature of the gradient: it's a simple tradeoff, where the more you know about the function you're trying to optimise, the faster you can find the solution.