I have a classification problem and one of the predictors is a categorical variable X with four levels A,B,C,D that was transformed to three dummy variables A,B,C. I was trying to use the Recursive Feature Selection (RFE) in the caret package to conduct feature selection. How do I tell the RFE function to consider A,B,C,D together? so if say A is excluded, B&C are excluded too.
After fighting with this all day, I'm still going nowhere...Feeding RFE using the formula interface also doesn't work. I think RFE automatically converts any factors to dummy variables.
Below is my example code:
#rfe settings
lrFuncs$summary<- twoClassSummary
trainctrl <- trainControl(classProbs= TRUE,
summaryFunction = twoClassSummary)
ctrl<-rfeControl(functions=lrFuncs,method = "cv", number=3)
#Data pre-process to exclude nzv and highly correlated variables
x<-training[,c(1, 4:25, 27:39)]
x2<-model.matrix(~., data = x)[,-1]
nzv <- nearZeroVar(x2,freqCut = 300/1)
x3 <- x2[, -nzv]
corr_mat <- cor(x3)
too_high <- findCorrelation(corr_mat, cutoff = .9)
x4 <- x3[, -too_high]
excludes<-c(names(data.frame(x3[, nzv])),names(data.frame(x3[, too_high])))
#Exclude the variables identified
x_frame<-x[ , -which(names(x) %in% c(excludes))]
#Run rfe
set.seed((408))
#This does not work with the error below
glmProfile<-rfe(x_frame,y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
#it works if convert x_frame to matrix and then back to data frame, but this way rfe may remove some dummy variables (i.e.remove A but leave B&C)
glmProfile<-rfe(data.frame(model.matrix(~., data = x_frame)[,-1]),y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
x_frame here, contains categorical variables that have multiple levels.
Any help is highly appreciated!
First: yes, you are right that you cannot use categorial features with RFE (there's a reasonable explanation of this by Max here on CV). And interestingly, encoding all levels into dummy variables really causes an error, which can be resolved by removing one dummy variable. Consequently, I too would preprocess your data by creating dummy variables from the categorial variable with leaving out one level.
But I would not try to keep either all or none of the dummy variables in the end. If RFE throws some of them out (but not all), then some levels just seem to hold more valuable information than others. This should be reasonable. Imagine level A of A,B,C holds valuable information for your target variable. In case A was kept during dummy variable creation, B and C would likely be discarded by RFE. In case A was discarded during dummy variable creation, B and C would likely both be kept by RFE.
PS: when mixing continuous and categorial information, consider scaling your data accordingly before handing it to RFE to ensure that the impact of continuous and categorial information on RFE is somewhat similar.