I am working on building a model using MLR package by supervised methods. The steps I followed are
1) Cleaned the data
2) Applied feature selection(Correlation based Feature Selection)
3) Did prediction using MLR package
library(mlr)
mlr_data <- as.data.frame(scale(df_allF[,res.cfs]))
mlr_data$label <- factor(df_allF$label)
NAN_col <- sapply(mlr_data, function(x) all(is.nan(x)))
mlr_data <- mlr_data[,!NAN_col]
task <- makeClassifTask(data = mlr_data,target = "label")
task <- normalizeFeatures(task,method = "standardize")
lrn = makeLearner("classif.rpart", predict.type = "prob")
rdesc = makeResampleDesc("LOO")
rin = makeResampleInstance(rdesc, task)
#Search for hyperparameters
ps <- makeParamSet(
makeIntegerParam("minsplit",lower = 10, upper = 50),
makeIntegerParam("minbucket", lower = 5, upper = 50),
makeNumericParam("cp", lower = 0.001, upper = 0.2)
)
ctrl1 = makeTuneControlRandom(tune.threshold = TRUE)
lrn1 = tuneParams(lrn, resampling = rdesc,task = task, measures = acc, par.set = ps, control = ctrl1)
rpart.tree <- setHyperPars(lrn, par.vals = lrn1$x)
t.rpart <- train(rpart.tree, task)
getLearnerModel(t.rpart)
tpmodel <- predict(t.rpart, task)
cat("\nConfusion Matrix before setting Threshold:\n ")
calculateConfusionMatrix(tpmodel)
threshold.update <- lrn1$threshold
tpmodel <- setThreshold(tpmodel,threshold.update)
cat("\nConfusion Matrix after setting Threshold:\n ")
calculateConfusionMatrix(tpmodel)
cat("\nMeasures : ")
m1 <- performance(tpmodel, acc)
m2 <- measureF1(tpmodel$data$truth,tpmodel$data$response,"Healthy")
cat("F1 = ",m2,"Accuracy = ",m1)
The results of F1 and Accuracy when
Dataset with all Features
F1 = 0.923, Accuracy = 0.928
Dataset with selected features(CFS)
F1 = 0.863, Accuracy = 0.857
Dataset with selected features using Information Gain
F1 = 0.8947, Accuracy = 0.904
Here, the results are not improving. The whole dataset consist of 154 features and 42 columns.
Do I have a solution or reason for this? I tried most of Feature Selection method.But no improvements.
The feature selection methods you're using don't use the performance of a classifier to select features. For that, use a wrapper method that explicitly takes performance into account. That said, there's no guarantee that feature selection will improve performance.