The bagging wrapper seems to give strange results. If I apply it to a simple logistic regression then the logloss is amplyfied by a factor of 10:
library(mlbench)
library(mlr)
data(PimaIndiansDiabetes)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"), bw.iters = 10, bw.replace = TRUE, bw.size = 0.8, bw.feats = 1)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 5L)
resample(learner = non.bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
resample(learner = bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
gives
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg
logloss.aggr: 0.49
logloss.mean: 0.49
logloss.sd: 0.02
Runtime: 0.0699999
for the first learner and
Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 5.41
logloss.mean: 5.41
logloss.sd: 0.80
Runtime: 0.645
for the bagged one. Thus the performance of the bagged one is much worse. Is there a bug or did I do something wrong?
This is my sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlr_2.9 stringi_1.1.1 ParamHelpers_1.8 ggplot2_2.1.0 BBmisc_1.10 mlbench_2.1-1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 magrittr_1.5 splines_3.3.1 munsell_0.4.3 lattice_0.20-33 xtable_1.8-2 colorspace_1.2-6
[8] R6_2.1.2 plyr_1.8.4 dplyr_0.5.0 tools_3.3.1 parallel_3.3.1 grid_3.3.1 checkmate_1.8.1
[15] data.table_1.9.6 gtable_0.2.0 DBI_0.4-1 htmltools_0.3.5 ggvis_0.4.3 survival_2.39-4 assertthat_0.1
[22] digest_0.6.9 tibble_1.1 Matrix_1.2-6 shiny_0.13.2 mime_0.5 parallelMap_1.3 scales_0.4.0
[29] backports_1.0.3 httpuv_1.3.3 chron_2.3-47
There's not necessarily anything wrong with this result, though the bagging model could be better specified.
Bagging doesn't necessarily always give you better performance statistics, rather it helps you avoid overfitting and improves accuracy.
Thus the reason that your non-bagging model has better performance statistics may simply be that it's overfitting or otherwise producing a more biased result with misleading performance statistics.
However, here's a much improved specification of the bagging model that gets the average logloss down by 70%:
pacman::p_load(mlbench,mlr)
data(PimaIndiansDiabetes)
set.seed(1)
trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"),
bw.iters = 100,
bw.replace = TRUE,
bw.size = .6,
bw.feats = .5)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
rdesc = makeResampleDesc("CV", iters = 10L)
resample(learner = non.bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
resample(learner = bagged.lrn,
task = trainTask1,
resampling = rdesc,
show.info = T,
measures = logloss)
where the key result is
Resample Result Task: PimaIndiansDiabetes Learner: classif.logreg.bagged logloss.aggr: 1.65 logloss.mean: 1.65 logloss.sd: 0.90 Runtime: 14.0544