Search code examples
mlr

How does the wrapper normalizeFeatures behave with a validation set?


I am wondering how the function normalizeFeatures works along with a resampling strategy. Which of these statements is true?

  1. The whole task data is normalized
  2. The training data is normalized, and the parameters of that normalization (let's say, mean and sd in a classsic standardization) are used to normalize the validation data (what mlrCPO::retrafo does in some way).

Thank you for your help!


Solution

  • The function normalizeFeatures() can be called on a data.frame and a Task object. In both cases it does the same. It simply normalizes the whole task. So statement 1) is true.

    If you want to achieve the second you have two options:

    a) preprocWrapperCaret

    The wrapper will put the scaling infront of the training and the prediction. For the training the scaling parameters will be saved and applied. For the prediction the saved scaling parameters will be applied.

    library(mlr)
    lrn = makeLearner("classif.svm")
    lrn = makePreprocWrapperCaret(lrn, ppc.center = TRUE, ppc.scale = TRUE)
    
    set.seed(1)
    res = resample(lrn, iris.task, resampling = hout, models = TRUE)
    
    # the scaling parameters learnt on the training spit
    res$models[[1]]$learner.model$control$mean
    
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           5.831        3.030        3.782        1.222 
    
    res$models[[1]]$learner.model$control$std
    
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       0.8611356    0.4118203    1.7487877    0.7710127 
    

    b) mlrCPO

    A bit more elegant and flexible approach is to built a preprocessing pipeline with the mlrCPO package which has the same effect as a wrapper in this case.

    library(mlr)
    library(mlrCPO)
    lrn = cpoScale(center = TRUE, scale = TRUE) %>>% makeLearner("classif.svm")
    set.seed(1)
    res = resample(lrn, iris.task, resampling = hout, models = TRUE)
    # the scaling parameters learnt on the training spit
    res$models[[1]]$learner.model$retrafo$element$state
    
    $center
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           5.831        3.030        3.782        1.222 
    
    $scale
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       0.8611356    0.4118203    1.7487877    0.7710127 
    

    I set the seed to obtain the same training split for both cases so that the learnt scaling parameters are the same for both approaches.