python scikit-learn linear-regression statsmodels

Results of sklearn/statsmodels ordinary least squares under singular covariance matrix

When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.

The question is then twofold:

What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.

Solution

two related questions and answers

Differences in Linear Regression in R and Python

Statsmodels with partly identified model

To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?

Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.

Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.

We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.

Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.

To 2) What does it output as coefficients then?

The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)

Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.

A software package has essentially 3 options to handle singular cases

raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse

statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).

One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.