machine-learning regression normalization linear-regression gradient-descent

Feature scaling (normalization) in multiple regression analysis with normal equation method?

I am doing linear regression with multiple features. I decided to use normal equation method to find coefficients of linear model. If we use gradient descent for linear regression with multiple variables we typically do feature scaling in order to quicken gradient descent convergence. For now, I am going to use normal equation formula:

I have two contradictory information sources. In 1-st it is stated that no feature scaling required for normal equations. In another I can see that feature normalization has to be done. Sources:

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html

http://puriney.github.io/numb/2013/07/06/normal-equations-gradient-descent-and-linear-regression/

At the end of these two articles information concerning feature scaling in normal equations presented.

The question is do we need to do feature scaling before normal equation analysis?

Solution

You may indeed not need to scale your features, and from theoretical point of view you get solution in just one "step". In practice, however, things might be a bit different.

Notice matrix inversion in your formula. Inverting a matrix is not quite trivial computational operation. In fact, there's a measure of how hard it's to invert a matrix (and perform some other computations), called condition number:

If the condition number is not too much larger than one (but it can still be a multiple of one), the matrix is well conditioned which means its inverse can be computed with good accuracy. If the condition number is very large, then the matrix is said to be ill-conditioned. Practically, such a matrix is almost singular, and the computation of its inverse, or solution of a linear system of equations is prone to large numerical errors. A matrix that is not invertible has condition number equal to infinity.

P.S. Large condition number is actually the same problem that slows down gradient descent's convergence.