Search code examples
rdplyrlogistic-regressionr-carettraining-data

preprocessing (center and scale) only specific variables (numeric variables)


I have a dataframe that consist of numerical and non-numerical variables. I am trying to fit a logisic regression model predicting my variable "risk" based on all other variables, optimizing AUC using a 6-fold cross validation. However, I want to center and scale all numerical explanatory variables. My code raises no errors or warning but somehow I fail to figure out how to tell train() through preProcess (or in some other way) to just center and scale my numerical variables.

Here is the code:

test <- train(risk ~ .,
              method = "glm",
              data = df,
              family = binomial(link = "logit"),
              preProcess = c("center", "scale"),
              trControl = trainControl(method = "cv",
                                       number = 6,
                                       classProbs = TRUE,
                                       summaryFunction = prSummary),
              metric = "AUC")

Solution

  • You could try to preprocess all numerical variables in original df first and then applying train function over scaled df

    library(dplyr)
    library(caret)
    
    df <- df %>%
            dplyr::mutate_if(is.numeric, scale)
    
    test <- train(risk ~ .,
                  method = "glm",
                  data = df,
                  family = binomial(link = "logit"),
                  trControl = trainControl(method = "cv",
                                           number = 6,
                                           classProbs = TRUE,
                                           summaryFunction = prSummary),
                  metric = "AUC")