Search code examples
rggplot2glmboundary

Drawing the glm decision boundary with ggplot's stat_smooth() function returns wrong line


I want to plot the decision boundary after I fit a logistic regression model to my data. I use ggplot and stat_smooth() function to define the decision boundary line. However the plot returned is wrong. For a reproducible example, see below:

#-----------------------------------------------------------------------------------------------------
# CONSTRUCT THE DATA
#-----------------------------------------------------------------------------------------------------

X.1_Y.1 <- rnorm(1000, mean = 1.5, sd= 0.3)

X.2_Y.1 <- rnorm(1000, mean = 1.5, sd= 5)

X.1_Y.0 <- rnorm(99000, mean = 0, sd = 1)

X.2_Y.0 <- rnorm(99000, mean = 0, sd = 1)

data <- data.table(X.1 = c(X.1_Y.1 , X.1_Y.0),
                   X.2 = c(X.2_Y.1  , X.2_Y.0),
                   Y = c(rep(1, 1000) , rep(0, 99000 ))
                   )


#-----------------------------------------------------------------------------------------------------
# FIT A LOGISTIC MODEL ON THE DATA
#-----------------------------------------------------------------------------------------------------


model <- glm(Y ~ X.1 + X.2, data, family = "binomial")

summary(model)

#Call:
#  glm(formula = Y ~ ., family = "binomial", data = data)

#Deviance Residuals: 
#  Min       1Q   Median       3Q      Max  
#-1.6603  -0.1194  -0.0679  -0.0384   4.6263  

#Coefficients:
#  Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -6.04055    0.06636  -91.02   <2e-16 ***
#  X.1          1.60828    0.03854   41.73   <2e-16 ***
#  X.2          0.43272    0.01673   25.87   <2e-16 ***
#  ---
#  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#(Dispersion parameter for binomial family taken to be 1)

#Null deviance: 11200.3  on 99999  degrees of freedom
#Residual deviance:  8218.5  on 99997  degrees of freedom
#AIC: 8224.5


#-------------------------------------------------------------------------------------------------------
# DEFINE AND DRAW THE DECISION BOUNDARY
#-------------------------------------------------------------------------------------------------------

# 0 = -6.04 + 1.61 * X.1 + 0.44 * X2 => X2 = 6.04/0.44 - 1.61/0.44 * X.1

setDT(data)


ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
  geom_point(alpha = 0.2) + 
   stat_smooth(formula = x.2 ~ 6.04/0.44 - (1.61/0.44) * X.1, color = "blue", size = 2) +
  coord_equal() +
  theme_economist()

This returns the following plot:

enter image description here

You can easily see that the line drawn is wrong. According to the formula X.2 should be 6.04/0.44 when X.1 = 0 which clearly is not the case in this plot.

Could you tell me where my code errs and how to correct it?

Your advice will be appreciated.


Solution

  • If you are trying to plot a line on your graph that you fit yourself, you should not be using stat_smooth, you should be using stat_function. For example

    ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
      geom_point(alpha = 0.2) + 
      stat_function(fun=function(x) {6.04/0.44 - (1.61/0.44) * x}, color = "blue", size = 2) +
      coord_equal()