Search code examples
rstatisticslogistic-regressionmodelingglm

adding quadratic term to logistic regression


I am trying to model a logistic regression with a couple of variables. I see that one of my variables has a quadratic trend, by plotting response by that variable and fitting a loess curve on it. So, I want to add a quadratic term to my logistic regression model, to model this variable with a quadratic trend. I'm having some trouble figuring out how to do this in the best / most accurate way.

Ex below:

Create df:

set.seed(1)
df <- data.frame(response = c(rep(0,times=30),rep(1,times=20)),
                 var1 = runif(50,min=12,max=30),
                 var2 = c(runif(20,min=0,max=25),runif(10,min=30,max=50),runif(20,min=15,max=40)),
                 var3 = var2^2) # note that this is just var2 squared

Plot by the second variable to view quadratic trend

ggplot(df,aes(x=var2,y=response)) +
  geom_point() +
  geom_smooth(method="loess")+
  coord_cartesian(ylim = c(0,1))

test a few different model formulas

formulas <- list(response ~ var1 + var2,              # both vars linear
                 response ~ var1 + var2 + I(var2^2),  # add quad term for var2
                 response ~ var1 + I(var2^2),         # only quad term for var2
                 response ~ var1 + var2 + var3,       # add var3, which is var2^2
                 response ~ var1 + var3)              # only var1 and var3

# build a df of some model selection criteria:
selection <-  purrr::map_df(formulas, ~{   
  mod <- glm(.x, data= df, family="binomial")
  data.frame(formula = format(.x), 
             AIC = round(AIC(mod),2), 
             BIC = round(BIC(mod),2),
             R2adj = round(DescTools::PseudoR2(mod,which=c("McFaddenAdj")),4)
  )
}) %>% arrange(desc(AIC))

view selection criteria:

> selection
                             formula   AIC   BIC  R2adj
1        response ~ var1 + I(var2^2) 65.88 71.62 0.0211
2             response ~ var1 + var2 65.26 70.99 0.0304
3      response ~ var1 + var2 + var3 64.69 72.33 0.0389
4             response ~ var1 + var3 63.18 68.91 0.0613
5 response ~ var1 + var2 + I(var2^2) 45.09 52.74 0.3300

Basically I'm wondering- can someone explain to me why these are all different? What should I be using to use one term with a quadratic pattern? Why am I getting such different results?


Solution

  • I get different results to you:

    > selection
                                 formula  AIC   BIC   R2adj
    1 response ~ var1 + var2 + I(var2^2) 40.4 48.05  0.3997
    2      response ~ var1 + var2 + var3 40.4 48.05  0.3997
    3             response ~ var1 + var2 70.5 76.23 -0.0475
    4        response ~ var1 + I(var2^2) 72.6 78.34 -0.0788
    5             response ~ var1 + var3 72.6 78.34 -0.0788
    

    Which makes sense to me. So I don't know what you did. Maybe you changed the data?

    Edit: I'm thinking that you've got a floating var3 vector outside the df which is not the same as the one you think. I mean, it's not var2^2. Creating a data frame in base R is not the same as using third party packages such as dplyr, which allows you to create new variables from other variables "promised" to be created in the data frame. You should probably use the tibble function:

    set.seed(1)
    df <- tibble(response = c(rep(0,times=30), rep(1,times=20)),
                 var1 = runif(50,min=12,max=30),
                 var2 = c(runif(20,min=0,max=25), runif(10,min=30,max=50), runif(20,min=15,max=40)),
                 var3 = var2^2)