I am trying to model a logistic regression with a couple of variables. I see that one of my variables has a quadratic trend, by plotting response by that variable and fitting a loess curve on it. So, I want to add a quadratic term to my logistic regression model, to model this variable with a quadratic trend. I'm having some trouble figuring out how to do this in the best / most accurate way.
Ex below:
Create df:
set.seed(1)
df <- data.frame(response = c(rep(0,times=30),rep(1,times=20)),
var1 = runif(50,min=12,max=30),
var2 = c(runif(20,min=0,max=25),runif(10,min=30,max=50),runif(20,min=15,max=40)),
var3 = var2^2) # note that this is just var2 squared
Plot by the second variable to view quadratic trend
ggplot(df,aes(x=var2,y=response)) +
geom_point() +
geom_smooth(method="loess")+
coord_cartesian(ylim = c(0,1))
test a few different model formulas
formulas <- list(response ~ var1 + var2, # both vars linear
response ~ var1 + var2 + I(var2^2), # add quad term for var2
response ~ var1 + I(var2^2), # only quad term for var2
response ~ var1 + var2 + var3, # add var3, which is var2^2
response ~ var1 + var3) # only var1 and var3
# build a df of some model selection criteria:
selection <- purrr::map_df(formulas, ~{
mod <- glm(.x, data= df, family="binomial")
data.frame(formula = format(.x),
AIC = round(AIC(mod),2),
BIC = round(BIC(mod),2),
R2adj = round(DescTools::PseudoR2(mod,which=c("McFaddenAdj")),4)
)
}) %>% arrange(desc(AIC))
view selection criteria:
> selection
formula AIC BIC R2adj
1 response ~ var1 + I(var2^2) 65.88 71.62 0.0211
2 response ~ var1 + var2 65.26 70.99 0.0304
3 response ~ var1 + var2 + var3 64.69 72.33 0.0389
4 response ~ var1 + var3 63.18 68.91 0.0613
5 response ~ var1 + var2 + I(var2^2) 45.09 52.74 0.3300
Basically I'm wondering- can someone explain to me why these are all different? What should I be using to use one term with a quadratic pattern? Why am I getting such different results?
I get different results to you:
> selection
formula AIC BIC R2adj
1 response ~ var1 + var2 + I(var2^2) 40.4 48.05 0.3997
2 response ~ var1 + var2 + var3 40.4 48.05 0.3997
3 response ~ var1 + var2 70.5 76.23 -0.0475
4 response ~ var1 + I(var2^2) 72.6 78.34 -0.0788
5 response ~ var1 + var3 72.6 78.34 -0.0788
Which makes sense to me. So I don't know what you did. Maybe you changed the data?
Edit: I'm thinking that you've got a floating var3 vector outside the df which is not the same as the one you think. I mean, it's not var2^2. Creating a data frame in base R is not the same as using third party packages such as dplyr, which allows you to create new variables from other variables "promised" to be created in the data frame. You should probably use the tibble function:
set.seed(1)
df <- tibble(response = c(rep(0,times=30), rep(1,times=20)),
var1 = runif(50,min=12,max=30),
var2 = c(runif(20,min=0,max=25), runif(10,min=30,max=50), runif(20,min=15,max=40)),
var3 = var2^2)