Search code examples
rregressionmodeling

Why aren't my regression coefficients making sense in my R model?


I'm building a regression model with a decently large amount of data (2146 observations). These are repeated measures so I will be using a mixed model, however, I always like to start with a more simple model to help see what the data looks like. The problem is that my regression coefficients aren't making sense to me and I can't figure out why they change so drastically when added to the model.

Here is an example of the first, simple regression model:

fit1 <- lm(Outcome.Variable ~ Group, data = dat)
summary(fit1)

Call:
lm(formula = Outcome.Variable ~ Group, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-225.63  -75.96   -4.60   67.78  356.84 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  364.104      4.677  77.847  < 2e-16 ***
GroupB       -65.187      7.268  -8.969  < 2e-16 ***
GroupC       -31.776      6.982  -4.551 5.63e-06 ***
GroupD       -37.268      6.337  -5.881 4.73e-09 ***
GroupE       -11.172      7.661  -1.458 0.144902    
GroupF       -29.707      8.188  -3.628 0.000292 ***
GroupG       -10.443      6.963  -1.500 0.133853    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 91.42 on 2139 degrees of freedom
Multiple R-squared:  0.0464,    Adjusted R-squared:  0.04372 
F-statistic: 17.35 on 6 and 2139 DF,  p-value: < 2.2e-16

These coefficients make sense to me as the intercept is the mean of GroupA and estimates for each of the other Groups represent the difference from GroupA. A quick check of the data indicates this interpretation is correct:

library(dplyr)

dat %>%
    group_by(Group) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 7 × 2
  Group Outcome.Variable.Mean
  <chr>                 <dbl>
1     A              364.1045
2     B              298.9173
3     C              332.3286
4     D              326.8360
5     E              352.9324
6     F              334.3972
7     G              353.6617

I can build another simple linear regression with my second variable, Day:

fit2 <- lm(Outcome.Variable ~ Day, data = dat)
summary(fit2)

Call:
lm(formula = Outcome.Variable ~ Day, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-228.56  -43.45   -4.70   44.41  321.77 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  388.003      2.598 149.367   <2e-16 ***
Day2          -5.278      3.668  -1.439     0.15    
Day3        -136.108      3.589 -37.921   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.28 on 2143 degrees of freedom
Multiple R-squared:  0.4669,    Adjusted R-squared:  0.4664 
F-statistic: 938.6 on 2 and 2143 DF,  p-value: < 2.2e-16

Again, a quick check of the data indicates that these regression coefficient are interpreted correctly:

dat %>%
    group_by(Day) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable))

# A tibble: 3 × 2
     Day Outcome.Variable.Mean
  <fctr>                 <dbl>
1      1              388.0027
2      2              382.7242
3      3              251.8942

Now, the issue comes when I combine both of them into the model together:

fit3 <- lm(Outcome.Variable ~ Day + Group, data = dat)
summary(fit3)

Call:
lm(formula = Outcome.Variable ~ Day + Group, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-212.456  -43.442   -2.864   41.000  305.607 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  413.942      3.912 105.806  < 2e-16 ***
Day2          -5.801      3.504  -1.656   0.0979 .  
Day3        -136.663      3.429 -39.859  < 2e-16 ***
GroupB       -66.126      5.185 -12.753  < 2e-16 ***
GroupC       -31.813      4.980  -6.388 2.06e-10 ***
GroupD       -37.654      4.521  -8.329  < 2e-16 ***
GroupE        -9.777      5.465  -1.789   0.0738 .  
GroupF       -24.570      5.842  -4.206 2.71e-05 ***
GroupG       -10.067      4.967  -2.027   0.0428 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 65.21 on 2137 degrees of freedom
Multiple R-squared:  0.5152,    Adjusted R-squared:  0.5134 
F-statistic: 283.9 on 8 and 2137 DF,  p-value: < 2.2e-16

These regression coefficients don't make sense to me. The intercept should be the mean value for GroupA on Day1, however, a check of the data reveals that this is not the case at all:

as.data.frame(dat %>%
    group_by(Day, Group) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable)))

   Day Group Outcome.Variable.Mean
1    1     A              420.5681
2    1     B              331.6633
3    1     C              380.9213
4    1     D              382.2743
5    1     E              405.1115
6    1     F              392.5020
7    1     G              400.5005
8    2     A              405.3756
9    2     B              339.2346
10   2     C              389.3252
11   2     D              374.0798
12   2     E              388.7488
13   2     F              377.9685
14   2     G              395.5381
15   3     A              273.7767
16   3     B              229.6742
17   3     C              234.4119
18   3     D              230.6635
19   3     E              275.2313
20   3     F              254.7107
21   3     G              272.6063

What is happening here? I don't want to proceed to a mixed model without first understanding what is taking place in this more basic model. Why is the intercept not representative of the mean value for GroupA on Day1? Even the differences between the intercept and the other estimates don't appear correct. For example, The difference between the intercept and Day 2 is -5.8. However, the difference between GroupA on Day1 and GroupA on Day2 is 15 points.

Any help understanding what is taking place here would be greatly appreciated.


Solution

  • You're ignoring the interaction between the terms. Let me demonstrate using the mtcars data:

    First, I run a regression disp ~ factor(cyl) (I have to call factor because by default all variables in mtcars are numeric):

    library(dplyr)
    
    lm(disp ~ factor(cyl), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(cyl), data = mtcars)
    #> 
    #> Coefficients:
    #>  (Intercept)  factor(cyl)6  factor(cyl)8  
    #>       105.14         78.18        247.96
    
    mtcars %>% group_by(cyl) %>% summarize(mean = mean(disp))
    #> # A tibble: 3 x 2
    #>     cyl     mean
    #>   <dbl>    <dbl>
    #> 1     4 105.1364
    #> 2     6 183.3143
    #> 3     8 353.1000
    

    As you can see, the regression sets the intercept to the mean disp of group cyl = 4.

    Next, I run a regression disp ~ factor(gear):

    lm(disp ~ factor(gear), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(gear), data = mtcars)
    #> 
    #> Coefficients:
    #>   (Intercept)  factor(gear)4  factor(gear)5  
    #>         326.3         -203.3         -123.8
    
    mtcars %>% group_by(gear) %>% summarize(mean = mean(disp))
    #> # A tibble: 3 x 2
    #>    gear     mean
    #>   <dbl>    <dbl>
    #> 1     3 326.3000
    #> 2     4 123.0167
    #> 3     5 202.4800
    

    Once again, the regression's outputs are the group means.

    Now to combine them my regression formula is disp ~ factor(cyl) * factor(gear) which is equivalent to disp ~ factor(cyl) + factor(gear) + factor(cyl):factor(gear):

    lm(disp ~ factor(cyl)*factor(gear), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(cyl) * factor(gear), data = mtcars)
    #> 
    #> Coefficients:
    #>                (Intercept)                factor(cyl)6  
    #>                     120.10                      121.40  
    #>               factor(cyl)8               factor(gear)4  
    #>                     237.52                      -17.47  
    #>              factor(gear)5  factor(cyl)6:factor(gear)4  
    #>                     -12.40                      -60.23  
    #> factor(cyl)8:factor(gear)4  factor(cyl)6:factor(gear)5  
    #>                         NA                      -84.10  
    #> factor(cyl)8:factor(gear)5  
    #>                     -19.22
    
    
    mtcars %>% group_by(cyl, gear) %>% summarize(mean(disp))
    #> # A tibble: 8 x 3
    #> # Groups:   cyl [?]
    #>     cyl  gear `mean(disp)`
    #>   <dbl> <dbl>        <dbl>
    #> 1     4     3     120.1000
    #> 2     4     4     102.6250
    #> 3     4     5     107.7000
    #> 4     6     3     241.5000
    #> 5     6     4     163.8000
    #> 6     6     5     145.0000
    #> 7     8     3     357.6167
    #> 8     8     5     326.0000