Search code examples
pythonlinear-regressionstatsmodelsolsmultiplelinearregression

Repeated columns of a single variable when using statsmodels.formula.api package ols function in python


I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.

auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())

The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?

screenshot of repeated rows


Solution

  • It's likely because of the data type of the horsepower column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.

    Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):

    import numpy as np
    import pandas as pd
    import statsmodels.formula.api as smf
    
    df = pd.DataFrame(
        {
            'mpg': np.random.randint(20, 40, 50),
            'horsepower': np.random.randint(100, 200, 50)
        }
    )
    # convert integers to strings (or categories)
    df['horsepower'] = (
        df['horsepower'].astype('str')  # same result with .astype('category')
    )
    
    formula = 'mpg ~ horsepower'
    
    results = smf.ols(formula, df).fit()
    print(results.summary())
    

    Output (dummy coding):

    OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                    mpg   R-squared:                       0.778
    Model:                            OLS   Adj. R-squared:                 -0.207
    Method:                 Least Squares   F-statistic:                    0.7901
    Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.715
    Time:                        20:17:51   Log-Likelihood:                -110.27
    No. Observations:                  50   AIC:                             302.5
    Df Residuals:                       9   BIC:                             380.9
    Df Model:                          40                                         
    Covariance Type:            nonrobust                                         
    =====================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
    -------------------------------------------------------------------------------------
    Intercept            32.0000      5.175      6.184      0.000      20.294      43.706
    horsepower[T.103]    -4.0000      7.318     -0.547      0.598     -20.555      12.555
    horsepower[T.112]    -1.0000      7.318     -0.137      0.894     -17.555      15.555
    horsepower[T.116]    -9.0000      7.318     -1.230      0.250     -25.555       7.555
    horsepower[T.117]     6.0000      7.318      0.820      0.433     -10.555      22.555
    horsepower[T.118]     2.0000      7.318      0.273      0.791     -14.555      18.555
    horsepower[T.120]    -4.0000      6.338     -0.631      0.544     -18.337      10.337
    
    etc.
    

    Now, converting the strings back to integers:

    df['horsepower'] = pd.to_numeric(df.horsepower)
    # or df['horsepower'] = df['horsepower'].astype('int')
    
    results = smf.ols(formula, df).fit()
    print(results.summary())
    

    Output (as expected):

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                    mpg   R-squared:                       0.011
    Model:                            OLS   Adj. R-squared:                 -0.010
    Method:                 Least Squares   F-statistic:                    0.5388
    Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.466
    Time:                        20:24:54   Log-Likelihood:                -147.65
    No. Observations:                  50   AIC:                             299.3
    Df Residuals:                      48   BIC:                             303.1
    Df Model:                           1                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept     31.7638      3.663      8.671      0.000      24.398      39.129
    horsepower    -0.0176      0.024     -0.734      0.466      -0.066       0.031
    ==============================================================================
    Omnibus:                        3.529   Durbin-Watson:                   1.859
    Prob(Omnibus):                  0.171   Jarque-Bera (JB):                1.725
    Skew:                           0.068   Prob(JB):                        0.422
    Kurtosis:                       2.100   Cond. No.                         834.
    ==============================================================================
    
    Notes:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.