numpy linear-regression least-squares svd

Least square methods: normal equation vs svd

I tried to write my own code for linear regression, following the normal equation that beta = inv(X'X)X'Y. However, the square error is much bigger than the lstsq function in numpy.linalg. Could anybody explain to me why SVD method(that lstsq uses) is more accurate than normal equation? Thank you

Solution

I suspect the matrix X'X for your data has a high condition number. Trying to compute a numerical inverse of such a matrix can result in large errors. It is usually a bad idea to explicitly compute an inverse matrix (see, for example, http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ or http://epubs.siam.org/doi/abs/10.1137/1.9780898718027.ch14).

You can check the condition number using numpy.linalg.cond.

Here's an example. First create X and Y:

In [186]: X = np.random.randn(500, 30)

In [187]: Y = np.linspace(0, 1, len(X))

For this random X, the condition number is not large:

In [188]: np.linalg.cond(X.T.dot(X))
Out[188]: 2.4456380658308148

The normal equation and lstsq give the same result (according to numpy.allclose when using that function's default arguments):

In [189]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)

In [190]: betal, res, rnk, s = np.linalg.lstsq(X, Y)

In [191]: np.allclose(betan, betal)
Out[191]: True

Now tweak X by making two columns almost the same. This makes X'X almost singular, and gives it a large condition number:

In [192]: X[:,0] = X[:,1] + 1e-8*np.random.randn(len(X))

In [193]: np.linalg.cond(X.T.dot(X))
Out[193]: 3954529794300611.5

Now the normal equation gives a different result than lstsq:

In [194]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)

In [195]: betal, res, rnk, s = np.linalg.lstsq(X, Y)

In [196]: np.allclose(betan, betal)
Out[196]: False