Search code examples
rtidymodels

In building a predictive model in R, does using step_normalize function (tidymodels) require test data to be processed in the same way?


I'm sorry if this is is the wrong place to ask this question. I'm confused about part of the tidymodels functionality. If I have a dataset (in the below example ion_train) and I apply the function step_normalize to all predictors in the recipe to build an SVM, that will normalise numeric data to have a standard deviation of one and a mean of zero. Does that mean that when I apply my SVM to a test dataset (in the below example ion_test), that I first need to scale that test dataset to have a standard deviation of one and a mean of zero before I use the predict() function (at the bottom of the code below)?

library(tidymodels)
library(mlbench)
data(Ionosphere)

# preprocess dataset
Ionosphere <- Ionosphere %>% select(-V1, -V2)

# split into training and test data
ion_split <- initial_split(Ionosphere, prop = 3/5)

ion_train <- training(ion_split)
ion_test <- testing(ion_split) 

# make a recipe
iono_rec <-
  recipe(Class ~ ., data = ion_train)  %>%
  step_normalize(all_predictors()) 

# build the model and workflow
svm_mod <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_workflow <- 
      workflow() %>%
      add_recipe(iono_rec) %>%
      add_model(svm_mod)

# run model tuning
set.seed(35)
recipe_res <-
  svm_workflow %>% 
  tune_grid(
    resamples = bootstraps(ion_train, times = 2),
    metrics = metric_set(roc_auc),
    control = control_grid(verbose = TRUE, save_pred = TRUE)
  )

# chose best model, finalise workflow
best_mod <- recipe_res %>% select_best("roc_auc")
final_wf <- finalize_workflow(svm_workflow, best_mod)
final_mod <- final_wf %>% fit(ion_train)

predict_res <- predict(
        final_mod,
        ion_test,
        type = "prob")

Solution

  • You might want to read through this chapter on using recipes for data preprocessing and feature engineering. The idea of a recipe is that you estimate statistics from the training set and then apply that same preprocessing to any other data, like the testing set or new data at prediction time.

    Let's walk through that.

    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    library(mlbench)
    data(Ionosphere)
    
    # preprocess dataset
    Ionosphere <- Ionosphere %>% select(-V1, -V2)
    
    # split into training and test data
    ion_split <- initial_split(Ionosphere, prop = 3/5)
    
    ion_train <- training(ion_split)
    ion_test <- testing(ion_split) 
    
    # make a recipe
    iono_rec <-
      recipe(Class ~ ., data = ion_train)  %>%
      step_normalize(all_predictors()) 
    

    The function prep() is what calculates/estimates statistics from the training set. You can get out the result using bake(); when you use new_data = NULL, you get out the result for the training data, the original data used for estimation of, in this case, the mean and standard deviation.

    iono_rec %>%
      prep() %>%
      bake(new_data = NULL)
    #> # A tibble: 211 x 33
    #>        V3      V4     V5     V6     V7     V8     V9    V10     V11     V12
    #>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
    #>  1  0.698 -0.163   0.452 -0.227  0.527 -0.912  0.908 -0.298  0.587  -0.700 
    #>  2  0.708 -0.477   0.604 -1.10  -1.41  -2.00   0.908 -0.477 -0.0234 -1.74  
    #>  3  0.708 -0.102   0.739 -0.268  0.869 -0.414  0.676 -0.353  0.371  -0.221 
    #>  4  0.708 -1.12    0.739  1.99   0.276 -2.12  -1.20  -0.379 -0.927  -0.332 
    #>  5 -1.41  -0.0342 -1.40  -0.551 -1.21  -0.410 -0.890 -0.236 -0.859  -0.463 
    #>  6  0.656 -0.277   0.634 -0.753  0.721 -0.730  0.613 -0.968  0.489  -1.33  
    #>  7 -1.46  -0.0198 -1.21  -0.279  0.869 -2.12  -1.20  -0.379 -2.70   -2.41  
    #>  8  0.708 -1.34    0.739 -2.55   0.869 -2.12   0.908  0.401  0.849  -1.18  
    #>  9  0.708  0.159   0.739 -0.202  0.869 -0.288  0.908 -0.190  0.849   0.0754
    #> 10 -0.356 -2.30    0.739  0.328 -1.26  -2.12   0.908 -2.53  -0.151  -2.41  
    #> # … with 201 more rows, and 23 more variables: V13 <dbl>, V14 <dbl>, V15 <dbl>,
    #> #   V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
    #> #   V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
    #> #   V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>,
    #> #   V34 <dbl>, Class <fct>
    

    You can bake() other data, though, like the testing data or new data. This applies the mean and standard deviation (in this case with step_normalize()) from the training set to these other data sets. This is on purpose, to protect again data leakage. You want to apply statistics from the training set to your other data like new data or testing data.

    iono_rec %>%
      prep() %>%
      bake(new_data = ion_test)
    #> # A tibble: 140 x 33
    #>        V3      V4     V5     V6     V7     V8     V9    V10     V11    V12
    #>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
    #>  1  0.708 -0.0781  0.625 -0.131  0.706 -0.632  0.427 -0.732  0.0107 -0.753
    #>  2  0.629 -0.195   0.739 -0.605  0.869 -0.594  0.908 -1.16   0.717  -1.24 
    #>  3 -1.50  -0.225  -1.21  -0.279 -1.19  -0.180 -0.958 -0.956 -1.74   -1.12 
    #>  4  0.708  0.142   0.739 -0.698  0.869 -0.710  0.908 -1.31   0.849  -1.19 
    #>  5  0.708 -0.416   0.739 -0.511  0.869 -0.475  0.908 -0.794  0.743  -1.06 
    #>  6  0.708 -2.13    0.739  0.227  0.570 -0.955  0.908 -0.639  0.849   0.397
    #>  7 -1.46  -0.0198 -3.16  -2.55   0.869  1.76  -3.31   1.77  -2.70    1.74 
    #>  8  0.708  2.41   -1.21  -0.279 -1.19  -0.180 -3.31  -2.53  -0.927  -0.332
    #>  9  0.708 -0.231   0.739 -0.672  0.594 -1.77   0.799  0.936  0.768  -1.19 
    #> 10  0.708  0.184   0.739  0.116  0.869 -0.438  0.870  1.01   0.849   0.661
    #> # … with 130 more rows, and 23 more variables: V13 <dbl>, V14 <dbl>, V15 <dbl>,
    #> #   V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
    #> #   V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
    #> #   V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>,
    #> #   V34 <dbl>, Class <fct>
    

    Created on 2021-04-14 by the reprex package (v2.0.0)

    Now your example showed putting your recipe into a workflow. When you use a workflow() there are high level functions that handle these phases automatically. You do not have to manually use prep() and bake(). You can read more about the details of using a model workflow() here.