Search code examples
rlinear-regressiondummy-variable

Adding a dummy variable changes coefficients


Should adding a dummy variable change coefficients for other explanatory variables in a linear model? I thought it would only change the intercept but the coefficients have changed for non-intercept terms as well.

Here is the example code with mtcars data (sourced from: http://rstudio-pubs-static.s3.amazonaws.com/20516_29b941670a4b42688292b4bb892a660f.html

data(mtcars)
mtcars$am_text <- as.factor(mtcars$am)
levels(mtcars$am_text) <- c("Automatic", "Manual")


fit1 <- lm(mpg ~ am_text + wt, data = mtcars)
summary(fit1)

Call:
lm(formula = mpg ~ am_text + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5295 -2.3619 -0.1317  1.4025  6.8782 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.32155    3.05464  12.218 5.84e-13 ***
am_textManual -0.02362    1.54565  -0.015    0.988    
wt            -5.35281    0.78824  -6.791 1.87e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.098 on 29 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7358 
F-statistic: 44.17 on 2 and 29 DF,  p-value: 1.579e-09

Now running a linear model with a subset data:

# Here is without dummy variable, but now with subset data
fit2 <- lm(mpg ~ wt, data = mtcars[mtcars$am_text == "Automatic",])
summary(fit2)

Call:
lm(formula = mpg ~ wt, data = mtcars[mtcars$am_text == "Automatic",])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5227 -0.2168  1.4816  5.0610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  31.4161     2.9467  10.661 6.01e-09 ***
wt           -3.7859     0.7666  -4.939 0.000125 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.528 on 17 degrees of freedom
Multiple R-squared:  0.5893,    Adjusted R-squared:  0.5651 
F-statistic: 24.39 on 1 and 17 DF,  p-value: 0.0001246

Solution

  • Actually, the issue is that the slope coefficient in fit1 is actually for both automatic and manual cars combined, even though each factor gets its own intercept. If you include an interaction term between am_text and wtas well (am_text:wt), then you can compare better with the model of only automatic cars (fit2).

    fit3 <- lm(mpg ~ am_text + wt + am_text:wt, data = mtcars)
    summary(fit3)
    
    # Call:
    # lm(formula = mpg ~ am_text * wt, data = mtcars)
    # 
    # Residuals:
    #     Min      1Q  Median      3Q     Max 
    # -3.6004 -1.5446 -0.5325  0.9012  6.0909 
    # 
    # Coefficients:
    #                  Estimate Std. Error t value Pr(>|t|)    
    # (Intercept)       31.4161     3.0201  10.402 4.00e-11 ***
    # am_textManual     14.8784     4.2640   3.489  0.00162 ** 
    # wt                -3.7859     0.7856  -4.819 4.55e-05 ***
    # am_textManual:wt  -5.2984     1.4447  -3.667  0.00102 ** 
    # ---
    # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    # 
    # Residual standard error: 2.591 on 28 degrees of freedom
    # Multiple R-squared:  0.833,   Adjusted R-squared:  0.8151 
    # F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11
    

    Notice now that the coefficients of fit3 contain the intercept and slope of automatic cars by themselves, which matches those coefficients of fit2:

    coef(fit2) # fit only to automatic
    # (Intercept)          wt 
    #   31.416055   -3.785908 
    
    coef(fit3)
    # (Intercept)    am_textManual               wt am_textManual:wt 
    #   31.416055        14.878423        -3.785908        -5.298360