Search code examples
pythonscikit-learnlinear-regressionstatsmodels

OLS Regression: Scikit vs. Statsmodels?


Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

The dependent variable is how many levels each character gained during that week (int).

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

I would love to know:

  1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
  2. If I made a mistake, what is it and how to fix it?
  3. Could I have figured this out without asking here, and if so how?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

Thanks!


Solution

  • It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

    import numpy as np
    import statsmodels.api as sm
    from sklearn.linear_model import LinearRegression
    
    # Generate artificial data (2 regressors + constant)
    nobs = 100 
    X = np.random.random((nobs, 2)) 
    X = sm.add_constant(X)
    beta = [1, .1, .5] 
    e = np.random.random(nobs)
    y = np.dot(X, beta) + e 
    
    # Fit regression model
    sm.OLS(y, X).fit().params
    >> array([ 1.4507724 ,  0.08612654,  0.60129898])
    
    LinearRegression(fit_intercept=False).fit(X, y).coef_
    >> array([ 1.4507724 ,  0.08612654,  0.60129898])
    

    As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

    I recommend you use pandas and patsy to take care of this:

    import pandas as pd
    from patsy import dmatrices
    
    dat = pd.read_csv('wow.csv')
    y, X = dmatrices('levels ~ week + character + guild', data=dat)
    

    Or, alternatively, the statsmodels formula interface:

    import statsmodels.formula.api as smf
    dat = pd.read_csv('wow.csv')
    mod = smf.ols('levels ~ week + character + guild', data=dat).fit()
    

    Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html