Minimizing functions with large gradients using `scipy.optimize.minimize`

I need to optimize a scalar function in a high-dimensional space. The function varies quickly with changing arguments such that the (absolute value of) gradients are large. The optimizers in scipy.optimize.minimize fail because the minimization procedure takes steps that are too large. The following code illustrates the problem using a simple quadratic function.

from scipy.optimize import minimize

def objective(x, scalar=1):
    """
    Quadratic objective function with optional scalar.
    """
    # Report function call for debugging
    print "objective({}, scalar={})".format(x, scalar)
    # Return function value and gradient
    return x ** 2 * scalar, 2 * x * scalar

# This optimisation succeeds
print minimize(objective, 1, jac=True)
# This optimisation fails
print minimize(objective, 1, (1e8, ), jac=True)

Of course, I can manually rescale the value and gradient of the function of interest, but I was wondering whether there is a recommended approach for minimizing such functions, e.g. specifying a learning rate.

Solution

For large non-linear optimization problems typically one would pay attention to (at least) four things:

Scaling
Initial values
Bounds
Precise gradients and if possible second derivatives (for complex problems use a modeling system that allows automatic differentiation)

Some more advanced solvers may provide some support for automatic scaling. However scaling for non-linear problems is not that easy as the Jacobian will change (some strategies that are typically available are: scale only the linear part, scale linear + nonlinear part once at the beginning based on initial values, or rescale the problem during the iteration process). Linear solvers have an easier job in this respect (Jacobian is constant so we can scale once at the beginning). Scipy.optimize.minimize is not the most advanced so I would encourage you to scale things yourself (typically you can only do this once before the solver starts; in some cases you may even stop the solver to rescale and then call the solver again using the last point as initial value -- this sounds crazy but this trick helped me a few times). A good initial point and good bounds can also help in this respect (in order to keep the solver in reasonable regions where functions and gradients can be reliably evaluated). Finally sometimes model reformulations can help in providing better scaling (replace division by multiplication, taking logs etc).