Search code examples
pythonnumpystatsmodelsleast-squaresempty-list

No Residuals With Numpy's Least Squares


I am trying to compute a least squares problem in Numpy (i.e. Ordinary Least Squares (OLS) with Simple Regression) in order to find the corresponding R² value. However, in some cases, Numpy is returning an empty list for the residuals. Take the following over-determined example (i.e. more equations than unknowns) that illustrates this problem:

OLS problem

(Note: There is no constant factor (i.e. intercept) (i.e. an initial column vector of all 1's), therefore the Uncentered Total Sum of Squares (TSS) will be used.)

import numpy as np

A = np.array([[6, 6, 3], [40, 40, 20]]).T
y = np.array([0.5, 0.2, 0.6])

model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A, y, rcond=None)

# No Intercept, therefore use Uncentered Total Sum of Squares (TSS)
uncentered_tss = np.sum((y)**2)  
numpy_r2 = 1.0 - residuals / uncentered_tss

print("Numpy Model Parameter(s): " + str(model_parameters))
print("Numpy Sum of Squared Residuals (SSR): " + str(residuals))
print("Numpy R²: " + str(numpy_r2))

The following produces the following output:

Numpy Model Parameter(s): [0.00162999 0.01086661]
Numpy Sum of Squared Residuals (SSR): []
Numpy R²: []

According to the numpy documentation:

... residuals will be empty when the equations are under-determined or well-determined but return values when they are over-determined.

However, this problem is clearly over-determined (3 equations vs. 2 unknowns). I can even show that the residuals (and thereby the sum of squared residuals (SSR)) exist by computing the regression results given by the statsmodels's OLS function:

import statsmodels.api as sm

A = np.array([[6, 6, 3], [40, 40, 20]]).T
y = np.array([0.5, 0.2, 0.6])

statsmodel_model = sm.OLS(y, A)
regression_results = statsmodels_model.fit()

calculated_r_squared = 1.0 - regression_results.ssr / np.sum((y)**2)

print("Parameters: " + str(regression_results.params))
print("Residuals: " + str(regression_results.resid))
print("Statsmodels R²: " + str(regression_results.rsquared))
print("Manually Calculated R²: " + str(calculated_r_squared))

The following produces the following output:

Parameters: [0.00162999 0.01086661]
Residuals: [ 0.05555556 -0.24444444  0.37777778]
Statsmodels R²: 0.6837606837606838
Manually Calculated R²: 0.6837606837606838

(As you can see, the Statsmodels and Numpy models have agreeing parameters.)

Why does Numpy return an empty SSR array with the following example? Is this a bug with numpy.linalg.lstsq? If this is not a bug, then why is Statsmodels able to compute the sum of squared residuals (SSR) and numpy is not? One can also clearly compute the residuals by hand given the plane of best fit:

function plane


Solution

  • From documentation of numpy.linalg.lstsq():

    residuals : {(), (1,), (K,)} ndarray

    ... If the rank of a is < N or M <= N, this is an empty array. ...

    The rank of your matrix is 1.


    NOTE: What you think are "missing" residuals can be found using numpy as well (you do not need other packages):

    residuals = y - np.dot(A, model_parameters)