Search code examples
rr-caret

R caret preproc only part of the dataset variables and train the model


I have a training set with some dummy variables [0] and I do not want to preProc=c("center","scale") them, but I want to preProc=c("center","scale") all the not dummy variables in order to normalize them like here[1]. So as what the center and scale options make is the following:

  • center: subtract mean from values.
  • scale: divide values by standard deviation.

Would it make sense to make an array with all the non dummy variables, calculate the mean and SD of each variable, center and scale all of the values and then concat this array with another array that contains all the dummy variables resulting in new_array array and then train the model like this? or this would not work?

ctrl <- trainControl(method = "repeatedcv", number=10, repeats=3)
knn_model <- train (Class ~ ., data=new_array, method="knn", trControl=ctrl)

Note: I have asked this question already in CrossValidated but due to it is also related with StackOverflow I ask it again here.

[0] https://topepo.github.io/caret/pre-processing.html#dummy

[1] Dummy variables and preProcess


Solution

  • You could do this to have everything within caret

    Let say you have a data.frame called DF with your columns from 1:5 that are numeric and 6:10 that are factorial. You could do the following:

    PreProcovCenter <- preProcess(DF[,1:5])
    preProcovDummy <- dummyVars(DF[,6:10])
    
    DF[,1:5] <- predict(PreProcovCenter, DF[,1:5])
    DFDummy <- predict(PreProcovDummy, DF[,6:10])
    
    DF <- cbind(DF, DFDummy)
    

    and finally:

    knn_model <- train (Class ~ ., data=DF, method="knn", trControl=ctrl)