Search code examples
pythonpandaslinear-regressionstatsmodelsdummy-variable

Linear regression with dummy/categorical variables


I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit()?.

My code is the following:

datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)

I get this:

Age  Gender    Wage         Job         Classification 
32    Male  450000       Professor           High
28    Male  500000  Administrative           High
40  Female   20000       Professor            Low
47    Male   70000       Assistant         Medium
50  Female  345000       Professor         Medium
27  Female  156000       Assistant            Low
56    Male  432000  Administrative            Low
43  Female  100000  Administrative            Low

Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way:

df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
        df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)

Getting this:

 Age  Gender    Wage             Job Classification  Sex_male  Job_index
 32    Male  450000       Professor           High         1          1
 28    Male  500000  Administrative           High         1          2
 40  Female   20000       Professor            Low         0          1
 47    Male   70000       Assistant         Medium         1          3
 50  Female  345000       Professor         Medium         0          1
 27  Female  156000       Assistant            Low         0          3
 56    Male  432000  Administrative            Low         1          2
 43  Female  100000  Administrative            Low         0          2

Now, if I would run a multiple linear regression, for example:

y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)

The result is shown normally, but would it be fine? Or do I have to indicate somehow that the variables are dummy or categorical?. Please help, I am new to Python and I want to learn. Greetings from South America - Chile.


Solution

  • You'll need to indicate that either Job or Job_index is a categorical variable; otherwise, in the case of Job_index it will be treated as a continuous variable (which just happens to take values 1, 2, and 3), which isn't right.

    You can use a few different kinds of notation in statsmodels, here's the formula approach, which uses C() to indicate a categorical variable:

    from statsmodels.formula.api import ols
    
    fit = ols('Wage ~ C(Sex_male) + C(Job) + Age', data=df).fit() 
    
    fit.summary()
    
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                   Wage   R-squared:                       0.592
    Model:                            OLS   Adj. R-squared:                  0.048
    Method:                 Least Squares   F-statistic:                     1.089
    Date:                Wed, 06 Jun 2018   Prob (F-statistic):              0.492
    Time:                        22:35:43   Log-Likelihood:                -104.59
    No. Observations:                   8   AIC:                             219.2
    Df Residuals:                       3   BIC:                             219.6
    Df Model:                           4                                         
    Covariance Type:            nonrobust                                         
    =======================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
    ---------------------------------------------------------------------------------------
    Intercept             3.67e+05   3.22e+05      1.141      0.337   -6.57e+05    1.39e+06
    C(Sex_male)[T.1]     2.083e+05   1.39e+05      1.498      0.231   -2.34e+05    6.51e+05
    C(Job)[T.Assistant] -2.167e+05   1.77e+05     -1.223      0.309    -7.8e+05    3.47e+05
    C(Job)[T.Professor] -9273.0556   1.61e+05     -0.058      0.958   -5.21e+05    5.03e+05
    Age                 -3823.7419   6850.345     -0.558      0.616   -2.56e+04     1.8e+04
    ==============================================================================
    Omnibus:                        0.479   Durbin-Watson:                   1.620
    Prob(Omnibus):                  0.787   Jarque-Bera (JB):                0.464
    Skew:                          -0.108   Prob(JB):                        0.793
    Kurtosis:                       1.839   Cond. No.                         215.
    ==============================================================================
    

    Note: Job and Job_index won't use the same categorical level as a baseline, so you'll see slightly different results for the dummy coefficients at each level, even though the overall model fit remains the same.