Is it possible that our algorithm will converge to different local minima if we use same data twice (twice randomization of the initial parameters)?

Suppose we are training a neural network using gradient descent using the same data twice (twice randomization of the initial parameters). Is it possible that our algorithm will converge to different local minima?

Solution

Yes. Gradient descent, as the name implies, goes "downhill" with respect to the loss function. But simply going downhill does not mean you will reach the lowest valley.

Consider this example with two local minima.

If the randomly initialized parameters lead to initial outputs near A, to the left of b, then gradient descent will go downhill toward A. But if the initial parameters lead to outputs on the right of b, closer to C, then the downhill direction is toward C.

Gradient descent will just go downhill. Which way that is, and where you might end up, depends a great deal on where you start.