I am writing a simple (gradient descent) code for linear regression with multi variables data set, my problem was that when I was testing the code I noticed that the cost still decreasing after 5 million iterations which means that my learning rate is small, I tried to increase it but I got overflow for the cost value, then when I normalized the data the problem had been solved and I could increase my learning rate without getting any error, I was wondering what is the relation between normalization and overflow for the cost.
gradient descent without normalization (small learning rate)
data without normalization (bigger learning rate)
Basically, normalization of the inputs gives the surface of the function you want to optimize a more spherical shape. Without this normalization, differences in the scale of the variables may cause the surface to be more ellipsoidal.
Now you could ask: why speherical vs. ellipsoidal matters?
As the gradient descent is a first derivative method it is not considering the curvature of the surface when choosing the direction before taking a step. Then, having an ellipsoidal surface (more irregular curvature) can cause trouble with convergence (this bringing overflow) specially if you set a large learning rate (the algorithm is taking bigger steps at each iteration). I think it is easier to understand by looking 2d plot example. With a spherical surface the gradient points at the minimum which makes learning easier.