machine-learning neural-network backpropagation gradient-descent supervised-learning

Decrement learning rate in error back propagation algorithm

This is more or less general question, in my implementation of backpropagation algorithm, I start from some "big" learning rate, and then decrease it after I see the error started to grow, instead of narrowing down. I am able to do this rate decrease either after I got error grow a bit (StateA), or just before it's about to grow (StateB, kind of rollback to previous "successful" state)

So the question is what is better from mathematical points of view? Or do I need to execute two parallel testing, let's say try to learn from point StateA, then point StateB both with reduced learning rate and compare which one is decreasing faster?

BTW I did't try approach from last paragraph. It's only pop up in mind during I write this question. In current implementation of algorithm I continue learning from StateA with decreased learning rate with assumptions that the decrease in learning rate is rather small to make me go back in previous direction to global minimum, if I accidentally faced only local minimum

Solution

What you describe is one of a collection of techniques called Learning Rate Scheduling. Just for you to know, there are more than two techniques:

Predetermined piesewise constant learning rate
Performance scheduling (looks like the closest one to yours)
Exponential scheduling
Power scheduling
...

The exact performance of each one greatly depends on the optimizer (SGD, Momentum, NAG, RMSProp, Adam, ...) and the data manifold (i.e. the training data and objective function). But they have been studied in regard to deep learning problems. For example, I'd recommend you this paper by Andrew Senior at al that compared various techniques for speech recognition task. The authors concluded is that exponential scheduling performed the best. If you're interested in math behind it, you should definitely take a look at their study.