Search code examples
rxgboosttidymodelsr-recipes

Preprocessing data with R `recipes` package: how to impute by mode in numeric columns (to fit model with xgboost)?


I want to use xgboost for a classification problem, and two predictors (out of several) are binary columns that also happen to have some missing values. Before fitting a model with xgboost, I want to replace those missing values by imputing the mode in each binary column.

My problem is that I want to do this imputation as part of a tidymodels "recipe". That is, not using typical data wrangling procedures such as dplyr/tidyr/data.table, etc. Doing the imputation within a recipe should guard against "information leakage".

Although the recipes package provides many step_*() functions that are designed for data preprocessing, I could not find a way to do the desired imputation by mode on numeric binary columns. While there is a function called step_impute_mode(), it accepts only nominal variables (i.e., of class factor or character). But I need my binary columns to remain numeric so they could be passed to the xgboost engine.

Consider the following toy example. I took it from this reference page and changed the data a bit to reflect the problem.

create toy data

# install.packages("xgboost")
library(tidymodels)
tidymodels_prefer()

# original data shipped with package
data(two_class_dat)

# simulating 2-column binary data + NAs
n_rows <- nrow(two_class_dat)

df_x1_x2 <-
  data.frame(x1 = rbinom(n_rows, 1, runif(1)),
             x2 = rbinom(n_rows, 1, runif(1)))

## randomly replace 25% of each column with NAs
df_x1_x2[c("x1", "x2")] <-
  lapply(df_x1_x2[c("x1", "x2")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })

# bind original data & simulated data
df_to_xgboost <- cbind(two_class_dat, df_x1_x2)

# split data to training and testing
data_train <- df_to_xgboost[-(1:10), ]
data_test  <- df_to_xgboost[  1:10 , ]

set up model specification & preprocessing recipe using tidymodels tools

# model specification
xgb_spec <- 
  boost_tree(trees = 15) %>% 
  # This model can be used for classification or regression, so set mode
  set_mode("classification") %>% 
  set_engine("xgboost")

# preprocessing recipe
xgb_recipe <-
  recipe(formula = Class ~ ., data = data_train) %>%
  step_bin2factor(x1, x2) %>% # <-~-~-~-~-~-~-~-~-~-~-~-~-~| these 2 lines are the heart of the problem
  step_impute_mode(x1, x2)    # <-~-~-~-~-~-~-~-~-~-~-~-~-~| I can't impute unless I first convert columns from numeric to factor/chr. 
#                                                          | But once I do, xgboost fails with non-numeric data. 
#                                                          | There isn't `step_*()` for converting back to numeric (like as.numeric())                      


# bind `xgb_spec` and `xgb_recipe` into a workflow object
xgb_wflow <-
  workflow() %>%
  add_recipe(xgb_recipe) %>% 
  add_model(xgb_spec)

fit the model

fit(xgb_wflow, data_train)
#> Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 3124.
#>   'data' accepts either a numeric matrix or a single filename.
#> Timing stopped at: 0 0 0

The fitting fails because data_train$x1 and data_train$x2 become factors per step_bin2factor(x1, x2). So that's my current catch: On the one hand, I can't fit xgboost model unless all data is numeric; on the other hand, I can't impute by mode unless data is factor/chr.

Although there is a way to build custom step_*() functions, it's a bit complex. So I first wanted to reach out and see whether there's a trivial solution I might be missing. I think that my current situation with xgboost and binary predictors seems pretty mainstream, and I don't want to reinvent the wheel.


Solution

  • Credit to user @gus who answered here:

    xgb_recipe <-
      recipe(formula = Class ~ ., data = data_train) %>%
      step_num2factor(c(x1, x2),
                      transform = function(x) x + 1,
                      levels = c("0", "1")) %>%
      step_impute_mode(x1, x2) %>%
      step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)
    

    The entire code

    # install.packages("xgboost")
    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    tidymodels_prefer()
    
    data(two_class_dat)
    
    n_rows <- nrow(two_class_dat)
    
    df_x1_x2 <-
      data.frame(x1 = rbinom(n_rows, 1, runif(1)),
                 x2 = rbinom(n_rows, 1, runif(1)))
    
    df_x1_x2[c("x1", "x2")] <-
      lapply(df_x1_x2[c("x1", "x2")], function(x) {
        x[sample(seq_along(x), 0.25 * length(x))] <- NA
        x
      })
    
    df_to_xgboost <- cbind(two_class_dat, df_x1_x2)
    ### 
    data_train <- df_to_xgboost[-(1:10), ]
    data_test  <- df_to_xgboost[  1:10 , ]
    
    xgb_spec <- 
      boost_tree(trees = 15) %>% 
      set_mode("classification") %>% 
      set_engine("xgboost")
    
    xgb_recipe <-
      recipe(formula = Class ~ ., data = data_train) %>%
      step_num2factor(c(x1, x2),
                      transform = function(x) x + 1,
                      levels = c("0", "1")) %>%
      step_impute_mode(x1, x2) %>%
      step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)
    
    xgb_recipe %>% prep() %>% bake(new_data = NULL)
    #> # A tibble: 781 x 5
    #>        A     B    x1    x2 Class 
    #>    <dbl> <dbl> <dbl> <dbl> <fct> 
    #>  1 1.44  1.68      1     1 Class1
    #>  2 2.34  2.32      1     1 Class2
    #>  3 2.65  1.88      0     1 Class2
    #>  4 0.849 0.813     1     1 Class1
    #>  5 3.25  0.869     1     1 Class1
    #>  6 1.05  0.845     0     1 Class1
    #>  7 0.886 0.489     1     0 Class1
    #>  8 2.91  1.54      1     1 Class1
    #>  9 3.14  2.06      1     1 Class2
    #> 10 1.04  0.886     1     1 Class2
    #> # ... with 771 more rows
    
    xgb_wflow <-
      workflow() %>%
      add_recipe(xgb_recipe) %>% 
      add_model(xgb_spec)
    
    fit(xgb_wflow, data_train)
    #> [09:35:36] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
    #> == Workflow [trained] ==========================================================
    #> Preprocessor: Recipe
    #> Model: boost_tree()
    #> 
    #> -- Preprocessor ----------------------------------------------------------------
    #> 3 Recipe Steps
    #> 
    #> * step_num2factor()
    #> * step_impute_mode()
    #> * step_mutate_at()
    #> 
    #> -- Model -----------------------------------------------------------------------
    #> ##### xgb.Booster
    #> raw: 59.4 Kb 
    #> call:
    #>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
    #>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    #>     subsample = 1, objective = "binary:logistic"), data = x$data, 
    #>     nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
    #> params (as set within xgb.train):
    #>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", objective = "binary:logistic", nthread = "1", validate_parameters = "TRUE"
    #> xgb.attributes:
    #>   niter
    #> callbacks:
    #>   cb.evaluation.log()
    #> # of features: 4 
    #> niter: 15
    #> nfeatures : 4 
    #> evaluation_log:
    #>     iter training_logloss
    #>        1         0.551974
    #>        2         0.472546
    #> ---                      
    #>       14         0.251547
    #>       15         0.245090
    

    Created on 2021-12-25 by the reprex package (v2.0.1.9000)