Search code examples
pythonscipyscikit-learnstatsmodelspymc

Quickest linear regression implementation in python


I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.

In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.

What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?

Or maybe an mcmc based approach using pymc is quickest...

Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.

Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...


Solution

  • The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.

    I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.

    Once you find a few candidate models, then you can try those on the whole dataset.