Search code examples
rlogistic-regressionglmsplinebspline

Avoid writing large number of column names in a model formula with bs() terms


I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.

df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
                  f= c("m","f"), output = c(0,1))
 
library(splines) 
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
                      factor(f) ,
                      data = df, 
                      family = "binomial") 

In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?


Solution

  • We can use some string manipulation with sprintf, together with reformulate:

    predictors <- c("a", "b", "d", "e")
    bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
    other.terms <- "factor(f)"
    form <- reformulate(c(bspl.terms, other.terms), response = "output")
    #output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e, 
    #    df = 2) + factor(f)
    

    If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).

    predictors <- c("a", "b", "d", "e")
    dof <- c(3, 4, 3, 6)
    degree <- c(2, 2, 2, 3)
    bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
    other.terms <- "factor(f)"
    form <- reformulate(c(bspl.terms, other.terms), response = "output")
    #output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) + 
    #    bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)
    

    Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").

    Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.