I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?
It's likely because of the data type of the horsepower
column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes
) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype=
parameter of the read_csv()
Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
Output (dummy coding):
OLS Regression Results
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
Now, converting the strings back to integers:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
Output (as expected):
OLS Regression Results
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.