Search code examples
pythonpython-2.7scikit-learnnon-linear-regression

How to output Regression Analysis summary from polynomial regression with scikit-learn?


I currently have the following code, which does a polynomial regression on a dataset with 4 variables:

def polyreg():
    dataset = genfromtxt(open('train.csv','r'), delimiter=',', dtype='f8')[1:]   
    target = [x[0] for x in dataset]
    train = [x[1:] for x in dataset]
    test = genfromtxt(open('test.csv','r'), delimiter=',', dtype='f8')[1:]

    poly = PolynomialFeatures(degree=2)
    train_poly = poly.fit_transform(train)
    test_poly = poly.fit_transform(test)

    clf = linear_model.LinearRegression()
    clf.fit(train_poly, target)

    savetxt('polyreg_test1.csv', clf.predict(test_poly), delimiter=',', fmt='%f')

I wanted to know if there was a way to output a summary of the regression like in Excel ? I explored the attributes/methods of linear_model.LinearRegression() but couldn't find anything.

enter image description here


Solution

  • This is not implemented in scikit-learn; the scikit-learn ecosystem is quite biased towards using cross-validation for model evaluation (this a good thing in my opinion; most of the test statistics were developed out necessity before computers were powerful enough for cross-validation to be feasible).

    For more traditional types of statistical analysis you can use statsmodels, here is an example taken from their documentation:

    import numpy as np
    import statsmodels.api as sm
    
    nsample = 100
    x = np.linspace(0, 10, 100)
    X = np.column_stack((x, x**2))
    beta = np.array([1, 0.1, 10])
    e = np.random.normal(size=nsample)
    
    X = sm.add_constant(X)
    y = np.dot(X, beta) + e
    
    model = sm.OLS(y, X)
    results = model.fit()
    print(results.summary())
                                OLS Regression Results
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       1.000
    Model:                            OLS   Adj. R-squared:                  1.000
    Method:                 Least Squares   F-statistic:                 4.020e+06
    Date:                Sun, 01 Feb 2015   Prob (F-statistic):          2.83e-239
    Time:                        09:32:32   Log-Likelihood:                -146.51
    No. Observations:                 100   AIC:                             299.0
    Df Residuals:                      97   BIC:                             306.8
    Df Model:                           2
    Covariance Type:            nonrobust
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    const          1.3423      0.313      4.292      0.000         0.722     1.963
    x1            -0.0402      0.145     -0.278      0.781        -0.327     0.247
    x2            10.0103      0.014    715.745      0.000         9.982    10.038
    ==============================================================================
    Omnibus:                        2.042   Durbin-Watson:                   2.274
    Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.875
    Skew:                           0.234   Prob(JB):                        0.392
    Kurtosis:                       2.519   Cond. No.                         144.
    ==============================================================================