Search code examples
numpylinear-regressionleast-squaressvd

Least square methods: normal equation vs svd


I tried to write my own code for linear regression, following the normal equation that beta = inv(X'X)X'Y. However, the square error is much bigger than the lstsq function in numpy.linalg. Could anybody explain to me why SVD method(that lstsq uses) is more accurate than normal equation? Thank you


Solution

  • I suspect the matrix X'X for your data has a high condition number. Trying to compute a numerical inverse of such a matrix can result in large errors. It is usually a bad idea to explicitly compute an inverse matrix (see, for example, http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ or http://epubs.siam.org/doi/abs/10.1137/1.9780898718027.ch14).

    You can check the condition number using numpy.linalg.cond.

    Here's an example. First create X and Y:

    In [186]: X = np.random.randn(500, 30)
    
    In [187]: Y = np.linspace(0, 1, len(X))
    

    For this random X, the condition number is not large:

    In [188]: np.linalg.cond(X.T.dot(X))
    Out[188]: 2.4456380658308148
    

    The normal equation and lstsq give the same result (according to numpy.allclose when using that function's default arguments):

    In [189]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
    
    In [190]: betal, res, rnk, s = np.linalg.lstsq(X, Y)
    
    In [191]: np.allclose(betan, betal)
    Out[191]: True
    

    Now tweak X by making two columns almost the same. This makes X'X almost singular, and gives it a large condition number:

    In [192]: X[:,0] = X[:,1] + 1e-8*np.random.randn(len(X))
    
    In [193]: np.linalg.cond(X.T.dot(X))
    Out[193]: 3954529794300611.5
    

    Now the normal equation gives a different result than lstsq:

    In [194]: betan = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
    
    In [195]: betal, res, rnk, s = np.linalg.lstsq(X, Y)
    
    In [196]: np.allclose(betan, betal)
    Out[196]: False