I tried to write my own code for linear regression, following the normal equation that beta = inv(X'X)X'Y
. However, the square error is much bigger than the lstsq
function in numpy.linalg
. Could anybody explain to me why SVD method(that lstsq uses) is more accurate than normal equation? Thank you
I suspect the matrix X'X
for your data has a high condition number. Trying to compute a numerical inverse of such a matrix can result in large errors. It is usually a bad idea to explicitly compute an inverse matrix (see, for example, http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ or http://epubs.siam.org/doi/abs/10.1137/1.9780898718027.ch14).
You can check the condition number using numpy.linalg.cond
.
Here's an example. First create X
and Y
:
In [186]: X = np.random.randn(500, 30)
In [187]: Y = np.linspace(0, 1, len(X))
For this random X
, the condition number is not large:
In [188]: np.linalg.cond(X.T.dot(X))
Out[188]: 2.4456380658308148
The normal equation and lstsq
give the same result (according to numpy.allclose
when using that function's default arguments):
In [189]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
In [190]: betal, res, rnk, s = np.linalg.lstsq(X, Y)
In [191]: np.allclose(betan, betal)
Out[191]: True
Now tweak X
by making two columns almost the same. This makes X'X
almost singular, and gives it a large condition number:
In [192]: X[:,0] = X[:,1] + 1e-8*np.random.randn(len(X))
In [193]: np.linalg.cond(X.T.dot(X))
Out[193]: 3954529794300611.5
Now the normal equation gives a different result than lstsq
:
In [194]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
In [195]: betal, res, rnk, s = np.linalg.lstsq(X, Y)
In [196]: np.allclose(betan, betal)
Out[196]: False