I'm building a regression model with a decently large amount of data (2146 observations). These are repeated measures so I will be using a mixed model, however, I always like to start with a more simple model to help see what the data looks like. The problem is that my regression coefficients aren't making sense to me and I can't figure out why they change so drastically when added to the model.
Here is an example of the first, simple regression model:
fit1 <- lm(Outcome.Variable ~ Group, data = dat)
summary(fit1)
Call:
lm(formula = Outcome.Variable ~ Group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-225.63 -75.96 -4.60 67.78 356.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 364.104 4.677 77.847 < 2e-16 ***
GroupB -65.187 7.268 -8.969 < 2e-16 ***
GroupC -31.776 6.982 -4.551 5.63e-06 ***
GroupD -37.268 6.337 -5.881 4.73e-09 ***
GroupE -11.172 7.661 -1.458 0.144902
GroupF -29.707 8.188 -3.628 0.000292 ***
GroupG -10.443 6.963 -1.500 0.133853
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 91.42 on 2139 degrees of freedom
Multiple R-squared: 0.0464, Adjusted R-squared: 0.04372
F-statistic: 17.35 on 6 and 2139 DF, p-value: < 2.2e-16
These coefficients make sense to me as the intercept is the mean of GroupA and estimates for each of the other Groups represent the difference from GroupA. A quick check of the data indicates this interpretation is correct:
library(dplyr)
dat %>%
group_by(Group) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 7 × 2
Group Outcome.Variable.Mean
<chr> <dbl>
1 A 364.1045
2 B 298.9173
3 C 332.3286
4 D 326.8360
5 E 352.9324
6 F 334.3972
7 G 353.6617
I can build another simple linear regression with my second variable, Day:
fit2 <- lm(Outcome.Variable ~ Day, data = dat)
summary(fit2)
Call:
lm(formula = Outcome.Variable ~ Day, data = dat)
Residuals:
Min 1Q Median 3Q Max
-228.56 -43.45 -4.70 44.41 321.77
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 388.003 2.598 149.367 <2e-16 ***
Day2 -5.278 3.668 -1.439 0.15
Day3 -136.108 3.589 -37.921 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 68.28 on 2143 degrees of freedom
Multiple R-squared: 0.4669, Adjusted R-squared: 0.4664
F-statistic: 938.6 on 2 and 2143 DF, p-value: < 2.2e-16
Again, a quick check of the data indicates that these regression coefficient are interpreted correctly:
dat %>%
group_by(Day) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 3 × 2
Day Outcome.Variable.Mean
<fctr> <dbl>
1 1 388.0027
2 2 382.7242
3 3 251.8942
Now, the issue comes when I combine both of them into the model together:
fit3 <- lm(Outcome.Variable ~ Day + Group, data = dat)
summary(fit3)
Call:
lm(formula = Outcome.Variable ~ Day + Group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-212.456 -43.442 -2.864 41.000 305.607
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 413.942 3.912 105.806 < 2e-16 ***
Day2 -5.801 3.504 -1.656 0.0979 .
Day3 -136.663 3.429 -39.859 < 2e-16 ***
GroupB -66.126 5.185 -12.753 < 2e-16 ***
GroupC -31.813 4.980 -6.388 2.06e-10 ***
GroupD -37.654 4.521 -8.329 < 2e-16 ***
GroupE -9.777 5.465 -1.789 0.0738 .
GroupF -24.570 5.842 -4.206 2.71e-05 ***
GroupG -10.067 4.967 -2.027 0.0428 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 65.21 on 2137 degrees of freedom
Multiple R-squared: 0.5152, Adjusted R-squared: 0.5134
F-statistic: 283.9 on 8 and 2137 DF, p-value: < 2.2e-16
These regression coefficients don't make sense to me. The intercept should be the mean value for GroupA on Day1, however, a check of the data reveals that this is not the case at all:
as.data.frame(dat %>%
group_by(Day, Group) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable)))
Day Group Outcome.Variable.Mean
1 1 A 420.5681
2 1 B 331.6633
3 1 C 380.9213
4 1 D 382.2743
5 1 E 405.1115
6 1 F 392.5020
7 1 G 400.5005
8 2 A 405.3756
9 2 B 339.2346
10 2 C 389.3252
11 2 D 374.0798
12 2 E 388.7488
13 2 F 377.9685
14 2 G 395.5381
15 3 A 273.7767
16 3 B 229.6742
17 3 C 234.4119
18 3 D 230.6635
19 3 E 275.2313
20 3 F 254.7107
21 3 G 272.6063
What is happening here? I don't want to proceed to a mixed model without first understanding what is taking place in this more basic model. Why is the intercept not representative of the mean value for GroupA on Day1? Even the differences between the intercept and the other estimates don't appear correct. For example, The difference between the intercept and Day 2 is -5.8. However, the difference between GroupA on Day1 and GroupA on Day2 is 15 points.
Any help understanding what is taking place here would be greatly appreciated.
You're ignoring the interaction between the terms. Let me demonstrate using the mtcars
data:
First, I run a regression disp ~ factor(cyl)
(I have to call factor
because by default all variables in mtcars
are numeric):
library(dplyr)
lm(disp ~ factor(cyl), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(cyl), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(cyl)6 factor(cyl)8
#> 105.14 78.18 247.96
mtcars %>% group_by(cyl) %>% summarize(mean = mean(disp))
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 105.1364
#> 2 6 183.3143
#> 3 8 353.1000
As you can see, the regression sets the intercept to the mean disp of group cyl = 4.
Next, I run a regression disp ~ factor(gear)
:
lm(disp ~ factor(gear), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(gear), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(gear)4 factor(gear)5
#> 326.3 -203.3 -123.8
mtcars %>% group_by(gear) %>% summarize(mean = mean(disp))
#> # A tibble: 3 x 2
#> gear mean
#> <dbl> <dbl>
#> 1 3 326.3000
#> 2 4 123.0167
#> 3 5 202.4800
Once again, the regression's outputs are the group means.
Now to combine them my regression formula is disp ~ factor(cyl) * factor(gear)
which is equivalent to disp ~ factor(cyl) + factor(gear) + factor(cyl):factor(gear)
:
lm(disp ~ factor(cyl)*factor(gear), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(cyl) * factor(gear), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(cyl)6
#> 120.10 121.40
#> factor(cyl)8 factor(gear)4
#> 237.52 -17.47
#> factor(gear)5 factor(cyl)6:factor(gear)4
#> -12.40 -60.23
#> factor(cyl)8:factor(gear)4 factor(cyl)6:factor(gear)5
#> NA -84.10
#> factor(cyl)8:factor(gear)5
#> -19.22
mtcars %>% group_by(cyl, gear) %>% summarize(mean(disp))
#> # A tibble: 8 x 3
#> # Groups: cyl [?]
#> cyl gear `mean(disp)`
#> <dbl> <dbl> <dbl>
#> 1 4 3 120.1000
#> 2 4 4 102.6250
#> 3 4 5 107.7000
#> 4 6 3 241.5000
#> 5 6 4 163.8000
#> 6 6 5 145.0000
#> 7 8 3 357.6167
#> 8 8 5 326.0000