Search code examples
rtidymodels

In R tidymodels how can I specify contrasts for specific variables?


I would like to specify "sum to zero" contrasts for two predictors in a LM using a tidymodels recipe. Is it possible? In looking at the recipes documentation, before 1.3, it seems there were attempts to build the variable specific options but the strategy was shifted to a global option.

I am trying to convert this base R code into tidymodels:

Bikeshare <- ISLR2::Bikeshare  # start with original data
contrasts(Bikeshare$hr) <- contr.sum(24)
contrasts(Bikeshare$mnth) <- contr.sum(12)
mod.lm2 <-
  lm(
    bikers ~ mnth + hr + workingday + temp + weathersit,
    data = Bikeshare
  )
summary(mod.lm2)

I got this far:

library(tidymodels)
Bikeshare <- ISLR2::Bikeshare  # start with original data
contrasts(Bikeshare$hr) <- contr.sum(24)
contrasts(Bikeshare$mnth) <- contr.sum(12)

lm_spec <- linear_reg() %>%
  set_engine("lm")

the_rec <- 
  recipe(
    bikers ~ mnth + hr + workingday + temp + weathersit,
    data = Bikeshare
  ) %>%
  step_dummy(c(mnth, hr), one_hot = TRUE)

the_workflow<- workflow() %>% 
  add_recipe(the_rec) %>% 
  add_model(lm_spec)

the_workflow_fit_lm_fit <- 
  fit(the_workflow, data = Bikeshare) %>% 
  extract_fit_parsnip()

summary(the_workflow_fit_lm_fit$fit)

Does anybody know how to get the same results out of a tidymodels workflow?

I don't think I can use contr.sum as a global option. This gives me the betas I would like for two of the variables but it changes the contrasts on others.

BikeShare <- ISLR2::Bikeshare # be sure to work with original data ; 
old_opt <- options()$contrast; 
options(contrasts = c('contr.sum', 'contr.poly'))

Solution

  • The docs for step_dummy() have :

    To change the type of contrast being used, change the global contrast option via options.

    so there is no way, outside of global options, to change it.

    We should probably have an example though :-/

    Note that, for new samples, the options are read from the global option again. Make sure that they are set the same at prediction-time:

    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    tidymodels_prefer()
    
    data("penguins")
    
    penguins <- 
      penguins %>% 
      distinct(species)
    
    # R's defaults
    old_opt <- options()$contrast
    old_opt
    #>         unordered           ordered 
    #> "contr.treatment"      "contr.poly"
    
    # default contrast
    default <- 
      recipe(~ species, data = penguins) %>% 
      step_dummy(species) %>% 
      prep()
    
    default %>%  bake(new_data = NULL)
    #> # A tibble: 3 × 2
    #>   species_Chinstrap species_Gentoo
    #>               <dbl>          <dbl>
    #> 1                 0              0
    #> 2                 0              1
    #> 3                 1              0
    
    # Do do something different
    
    # Now set to something else:
    options(contrasts = c('contr.sum', 'contr.poly'))
    
    with_opt <- 
      recipe(~ species, data = penguins) %>% 
      step_dummy(species) %>% 
      prep()
    
    with_opt %>% bake(new_data = NULL)
    #> # A tibble: 3 × 2
    #>   species_X1 species_X2
    #>        <dbl>      <dbl>
    #> 1          1          0
    #> 2         -1         -1
    #> 3          0          1
    
    # reset options: 
    
    options(contrasts = old_opt)
    with_opt %>% bake(new_data = penguins)
    #> # A tibble: 3 × 2
    #>   species_Chinstrap species_Gentoo
    #>               <dbl>          <dbl>
    #> 1                 0              0
    #> 2                 0              1
    #> 3                 1              0
    

    Created on 2021-11-16 by the reprex package (v2.0.0)

    edit for clarity