I want to plot the decision boundary after I fit a logistic regression model to my data. I use ggplot and stat_smooth() function to define the decision boundary line. However the plot returned is wrong. For a reproducible example, see below:
#-----------------------------------------------------------------------------------------------------
# CONSTRUCT THE DATA
#-----------------------------------------------------------------------------------------------------
X.1_Y.1 <- rnorm(1000, mean = 1.5, sd= 0.3)
X.2_Y.1 <- rnorm(1000, mean = 1.5, sd= 5)
X.1_Y.0 <- rnorm(99000, mean = 0, sd = 1)
X.2_Y.0 <- rnorm(99000, mean = 0, sd = 1)
data <- data.table(X.1 = c(X.1_Y.1 , X.1_Y.0),
X.2 = c(X.2_Y.1 , X.2_Y.0),
Y = c(rep(1, 1000) , rep(0, 99000 ))
)
#-----------------------------------------------------------------------------------------------------
# FIT A LOGISTIC MODEL ON THE DATA
#-----------------------------------------------------------------------------------------------------
model <- glm(Y ~ X.1 + X.2, data, family = "binomial")
summary(model)
#Call:
# glm(formula = Y ~ ., family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.6603 -0.1194 -0.0679 -0.0384 4.6263
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.04055 0.06636 -91.02 <2e-16 ***
# X.1 1.60828 0.03854 41.73 <2e-16 ***
# X.2 0.43272 0.01673 25.87 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 11200.3 on 99999 degrees of freedom
#Residual deviance: 8218.5 on 99997 degrees of freedom
#AIC: 8224.5
#-------------------------------------------------------------------------------------------------------
# DEFINE AND DRAW THE DECISION BOUNDARY
#-------------------------------------------------------------------------------------------------------
# 0 = -6.04 + 1.61 * X.1 + 0.44 * X2 => X2 = 6.04/0.44 - 1.61/0.44 * X.1
setDT(data)
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_smooth(formula = x.2 ~ 6.04/0.44 - (1.61/0.44) * X.1, color = "blue", size = 2) +
coord_equal() +
theme_economist()
This returns the following plot:
You can easily see that the line drawn is wrong. According to the formula X.2 should be 6.04/0.44 when X.1 = 0 which clearly is not the case in this plot.
Could you tell me where my code errs and how to correct it?
Your advice will be appreciated.
If you are trying to plot a line on your graph that you fit yourself, you should not be using stat_smooth
, you should be using stat_function
. For example
ggplot(data, aes(X.1, X.2, color = as.factor(Y))) +
geom_point(alpha = 0.2) +
stat_function(fun=function(x) {6.04/0.44 - (1.61/0.44) * x}, color = "blue", size = 2) +
coord_equal()