Search code examples
rggplot2linear-regressionpredict

Predicted regression line not same as in geom_smooth(method= "lm", formula= "y ~ x")


I made a scatter plot with ggplot2 for my data with a regression line and I noticed that the regression line is not same as the predicted values from predict(). Why is this the case? See example below:

# Make some data
set.seed(1)
n <- 100
df <- data.frame(x= rnorm(n, 100, 15), group= sample(0:1, n/2, replace= TRUE))
df$y <- df$x*.1 + df$group*2 + rnorm(n)

# Fit data and predict values
mod <- lm(y ~ x + group, df)
df_new <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
                                          0:1)),
                   nm= c("x", "group"))
df_pred <- cbind.data.frame(df_new, pred= predict(mod, df_new))

# Plot data with predicted values and geom_smooth()
library(ggplot2)
ggplot(df, aes(x, y, col= factor(group))) +
  geom_point() +
  geom_smooth(method= "lm", formula= "y ~ x") +
  geom_line(data=df_pred, mapping= aes(x, pred, col= factor(group)))

The resulting plot is:

Plot

One can see that the predicted line does not match the one from geom_smooth. Why?

Sidenotes: There is a similiar question but in that case the person simply mixed up variables. As far as I can tell this is not the case in my code. Further, for the generated data both lines lay almost above each other but in my real data (which I can not share) the missmatch is even bigger.


Solution

  • As user20650 points already out in the comments, ggplot adds an interaction term, whereas your model mod does only account for group difference based on a additional variable + group. To get the same results either use an interaction term * group or create two different models for each group. See both ways below:

    # Make some data
    set.seed(1)
    n <- 100
    df <- data.frame(x= rnorm(n, 100, 15), group= sample(0:1, n/2, replace= TRUE))
    df$y <- df$x*.1 + df$group*2 + rnorm(n)
    
    # Fit data and predict values
    
    # With interaction
    mod <- lm(y ~ x * group, df) # <- add an interaction term
    df_new <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
                                              0:1)),
                       nm= c("x", "group"))
    df_pred <- cbind.data.frame(df_new, pred = predict(mod, df_new))
    
    
    # Predict for each group
    mod0 <- lm(y ~ x, df[df$group == 0, ]) 
    
    df_new0 <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
                                              0)),
                       nm= c("x", "group"))
    
    
    df_pred0 <- cbind.data.frame(df_new0, pred = predict(mod0, df_new0))
    
    mod1 <- lm(y ~ x, df[df$group == 1, ])
    
    df_new1 <- setNames(data.frame(expand.grid(round(seq(min(df$x), max(df$x), 1), 0),
                                               1)),
                        nm= c("x", "group"))
    
    df_pred1 <- cbind.data.frame(df_new1, pred = predict(mod1, df_new1))
    
    df_pred2 <- rbind(df_pred0, df_pred1)
    
    # Plot data with predicted values and geom_smooth()
    library(ggplot2)
    
    ggplot(df, aes(x, y, col = factor(group))) +
      geom_point() +
      geom_smooth(method= "lm", formula= "y ~ x") +
      geom_line(data=df_pred, mapping= aes(x, pred, col = factor(group))) + # with interaction
      geom_line(data=df_pred2, mapping= aes(x, pred, col = factor(group))) # with seperate groups