I made a scatter plot with ggplot2
for my data with a regression line and I noticed that the regression line is not same as the predicted values from predict()
. Why is this the case? See example below:
# Make some data
set.seed(1)
n <- 100
df <- data.frame(x= rnorm(n, 100, 15), group= sample(0:1, n/2, replace= TRUE))
df$y <- df$x*.1 + df$group*2 + rnorm(n)
# Fit data and predict values
mod <- lm(y ~ x + group, df)
df_new <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
0:1)),
nm= c("x", "group"))
df_pred <- cbind.data.frame(df_new, pred= predict(mod, df_new))
# Plot data with predicted values and geom_smooth()
library(ggplot2)
ggplot(df, aes(x, y, col= factor(group))) +
geom_point() +
geom_smooth(method= "lm", formula= "y ~ x") +
geom_line(data=df_pred, mapping= aes(x, pred, col= factor(group)))
The resulting plot is:
One can see that the predicted line does not match the one from geom_smooth
. Why?
Sidenotes: There is a similiar question but in that case the person simply mixed up variables. As far as I can tell this is not the case in my code. Further, for the generated data both lines lay almost above each other but in my real data (which I can not share) the missmatch is even bigger.
As user20650 points already out in the comments, ggplot
adds an interaction term, whereas your model mod
does only account for group difference based on a additional variable + group
. To get the same results either use an interaction term * group
or create two different models for each group. See both ways below:
# Make some data
set.seed(1)
n <- 100
df <- data.frame(x= rnorm(n, 100, 15), group= sample(0:1, n/2, replace= TRUE))
df$y <- df$x*.1 + df$group*2 + rnorm(n)
# Fit data and predict values
# With interaction
mod <- lm(y ~ x * group, df) # <- add an interaction term
df_new <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
0:1)),
nm= c("x", "group"))
df_pred <- cbind.data.frame(df_new, pred = predict(mod, df_new))
# Predict for each group
mod0 <- lm(y ~ x, df[df$group == 0, ])
df_new0 <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
0)),
nm= c("x", "group"))
df_pred0 <- cbind.data.frame(df_new0, pred = predict(mod0, df_new0))
mod1 <- lm(y ~ x, df[df$group == 1, ])
df_new1 <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
1)),
nm= c("x", "group"))
df_pred1 <- cbind.data.frame(df_new1, pred = predict(mod1, df_new1))
df_pred2 <- rbind(df_pred0, df_pred1)
# Plot data with predicted values and geom_smooth()
library(ggplot2)
ggplot(df, aes(x, y, col = factor(group))) +
geom_point() +
geom_smooth(method= "lm", formula= "y ~ x") +
geom_line(data=df_pred, mapping= aes(x, pred, col = factor(group))) + # with interaction
geom_line(data=df_pred2, mapping= aes(x, pred, col = factor(group))) # with seperate groups