Search code examples
rstatisticstidyverselinear-regressiontidymodels

NULL variable not useable in an R formula


I am able to use tidymodels to build linear regression models, including with NULL explanatory variables. However, when I assign a variable to NULL and use that variable in the formula (until I have a chance to put a new vector in its place), I receive the following error:

Error in model.frame.default(formula = Y ~ X + n, data = data, drop.unused.levels = TRUE) : 
  invalid type (NULL) for variable 'n'

The working demo code is as follows:

data <- tibble(Y = c(1,3), X = c(2,3))
model <- linear_reg() |>
  set_engine("lm") |>
  fit(Y ~ X + NULL, data = data) # works as expected (as if NULL wasn't there)

And the broken code:

data <- tibble(Y = c(1,3), X = c(2,3))
n <- NULL
model <- linear_reg() |>
  set_engine("lm") |>
  fit(Y ~ X + n, data = data) # throws above error

Expected a tidy model output with relevant p-values and slope coefficients. Received the included error.

I know there are other ways to accomplish what I'm doing (a sort of WalMart brand forward selection), but my undergrad intro to data science course is restricting which libraries we are allowed to use, so I'm stuck with this weird scenario where I need placeholder variables as I iterate over possible combinations. Minimizing non-tidyverse/tidymodels libraries would be ideal but not required. Thanks!


Solution

  • The difference between Y ~ X + NULL and Y ~ X + n after n <- NULL is that the former says you have no second variable, while the latter says you have one. It just happens to have an unusable value.

    Models based on formulas are based on the way things are expressed, not the content of the variables. For another example, Y ~ X + 0 is very different from n <- 0; Y ~ X + n.

    So to get what you want, you need to modify the expression, not the values of the variables in it. One way to do that is to use substitute(), e.g.

    substitute(y ~ x + n, list(n = NULL))
    #> y ~ x + NULL