Search code examples
rr-caretcross-validationfeature-extraction

R: Feature Selection with Cross Validation using Caret on Logistic Regression


I am currently learning how to implement logistical Regression in R

I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features. I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.

I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example

library("caret")

# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit

# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8, 
                              list = FALSE, 
                              times = 1)

train <- mydf[ trainIndex,]
test  <- mydf[-trainIndex,]


ctrl <- trainControl(method = "repeatedcv", 
                 number = 10, 
                 savePredictions = TRUE)

mod_fit <- train(Class~., data=train, 
             method="glm", 
             family="binomial",
             trControl = ctrl, 
             tuneLength = 5)


# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)

# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)

Solution

  • You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient

    trControl <- trainControl(method="cv",
                              number = 5,
                              savePredictions = T,
                              classProbs = T,
                              summaryFunction = twoClassSummary)
    
    caret_model <- train(Class~.,
                         train,
                         method="glmStepAIC", # This method fits best model stepwise.
                         family="binomial",
                         direction="backward", # Direction
                         trControl=trControl)
    

    Note that in trControl

    method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
    classProbs = T,
    summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.