Search code examples
rregressionfactors

Automatically exclude single-level factor variables from regression


I have a custom function which, among other things, creates a regression formula based on specified string inputs and runs a regression (brm, but should work similarly for basic lm):

model_predict <- function(.data, dep_var, model ...) {
    form <- as.formula(str_glue("{dep_var} ~ {model}"))
    form_vars <- all.vars(form)

    ... # some other stuff

    fit <- brm(form, .data, ...)

    ... # some other stuff
}

I use this to fit a large number of pre-specified models as part of a larger workflow.

Sometimes, some of the variables in model are factor variables, and sometimes those factors have only one level in the data. This results in the contrasts can be applied only to factors with 2 or more levels error when trying to fit the model.

Because of the larger workflow and because it's not always clear a priori if the data and model for any given iteration is going to encounter this problem, I'd rather not manually remove the factor vars from the model specified when those vars only have one level in the relevant data subset.

Is there is some setting on lm or brm that can be used to tell the model fitting process itself to ignore single-level factor vars? that would be the easiest solution, but I'm not sure it exists.

Alternatively, I'd like an automated solution that identifies when the single factor level situation arises, and drops the relevant vars from the formula when it does (maybe giving a warning message too), for instance:

# main formula
form
> outcome ~ predictor_1 + predictor_2 * interactor

# Desired outputs...
# if predictor_1 has only one level in data
> outcome ~ predictor_2 * interactor

# if predictor_2 has only one level in data
> outcome ~ predictor_1

# if interactor has only one level in data
> outcome ~ predictor_1 + predictor_2

I've tried tryCatch as suggested here but while that suppresses the contrasts... error, it returns NULL instead of a fitted model ignoring the offending vars, which is what I need.

Additionally, sometimes those variables are in the formula with + and sometimes with * as interaction effects, which makes dynamically building the formula difficult.


Solution

  • Something like this should work:

    # For pipe operator
    library(magrittr)
    
    # Define function for updating the model
    update_formula <- function(data, dep_var, model) {
      # Extract model variables
      model_vars <- stringr::str_split_1(model, '\\+|\\*|:') %>% 
        stringr::str_trim()
      
      model_terms <- stringr::str_split_1(model, '\\+') %>% 
        stringr::str_trim()
      
      # Get the number of levels for all factor variables
      lev_leng <- .data %>% 
        dplyr::select(where(is.factor) & any_of(model_vars)) %>% 
        purrr::map_int(
          ~ length(levels(droplevels(.x)))
        )
      
      # Check if length is one
      invalid <- names(which(lev_leng == 1))
      
      # If any is invalid, it will have length > 0
      if (length(invalid) != 0) {
        for (i in invalid) {
          if (any(stringr::str_detect(model_terms, paste0('\\*[:space:]?', i)))) {
            # Remove interactor from formula
            model_terms <- stringr::str_remove(
              model_terms,
              paste0('[\\*,:]?[:space:]?',i,'[:space:]?[\\*,:]?')
            )
          } else{
            # Remove entire term from formula
            model_terms <- model_terms[stringr::str_detect(model_terms, i, T)]
          }
        }
        model <- stringr::str_flatten(model_terms, '+')
      }
      # Define the new formula
      as.formula(glue::glue('{dep_var} ~ {model}'))
    }
    
    
    # Dep var defition
    dep_var <- 'y'
    # Model example
    model <- "predictor_1 + predictor_2 * interactor"
    
    # Example predictor_1
    .data <- tibble::tibble(
      y = runif(10),
      predictor_1 = factor(1),
      predictor_2 = factor(letters[1:10]),
      interactor  = factor(letters[1:10])
    )
    
    # Update model
    form <- update_formula(.data, dep_var, model)
    form
    #> y ~ predictor_2 * interactor
    #> <environment: 0x0000015b95006a58>
    
    # Fit the model
    fit <- lm(form, .data)
    
    # Example predictor_2
    .data <- tibble::tibble(
      y = runif(10),
      predictor_1 = runif(10),
      predictor_2 = factor(letters[rep(1, 10)]),
      interactor  = factor(letters[1:10])
    )
    
    form <- update_formula(.data, dep_var, model)
    form
    #> y ~ predictor_1
    #> <environment: 0x0000015b95cdd218>
    
    # Example interactor
    .data <- tibble::tibble(
      y = runif(10),
      predictor_1 = runif(10),
      predictor_2 = factor(letters[1:10]),
      interactor  = factor(letters[rep(1, 10)])
    )
    
    form <- update_formula(.data, dep_var, model)
    form
    #> y ~ predictor_1 + predictor_2
    #> <environment: 0x0000015b96072878>
    
    # Example predictor_1 & interactor
    .data <- tibble::tibble(
      y = runif(10),
      predictor_1 = factor(10),
      predictor_2 = factor(letters[1:10]),
      interactor  = factor(letters[rep(1, 10)])
    )
    
    form <- update_formula(.data, dep_var, model)
    form
    #> y ~ predictor_2
    #> <environment: 0x0000015b9669d448>
    

    Created on 2023-06-13 with reprex v2.0.2