Search code examples
rmlr

Possible bug with bagging wrapper in mlr


The bagging wrapper seems to give strange results. If I apply it to a simple logistic regression then the logloss is amplyfied by a factor of 10:

library(mlbench)
library(mlr)

data(PimaIndiansDiabetes)

trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")

bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"), bw.iters = 10, bw.replace = TRUE, bw.size = 0.8, bw.feats = 1)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")

rdesc = makeResampleDesc("CV", iters = 5L)

resample(learner = non.bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
resample(learner = bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)

gives

Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg
logloss.aggr: 0.49
logloss.mean: 0.49
logloss.sd: 0.02
Runtime: 0.0699999

for the first learner and

Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 5.41
logloss.mean: 5.41
logloss.sd: 0.80

Runtime: 0.645

for the bagged one. Thus the performance of the bagged one is much worse. Is there a bug or did I do something wrong?

This is my sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlr_2.9          stringi_1.1.1    ParamHelpers_1.8 ggplot2_2.1.0    BBmisc_1.10      mlbench_2.1-1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6      magrittr_1.5     splines_3.3.1    munsell_0.4.3    lattice_0.20-33  xtable_1.8-2     colorspace_1.2-6
 [8] R6_2.1.2         plyr_1.8.4       dplyr_0.5.0      tools_3.3.1      parallel_3.3.1   grid_3.3.1       checkmate_1.8.1 
[15] data.table_1.9.6 gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  ggvis_0.4.3      survival_2.39-4  assertthat_0.1  
[22] digest_0.6.9     tibble_1.1       Matrix_1.2-6     shiny_0.13.2     mime_0.5         parallelMap_1.3  scales_0.4.0    
[29] backports_1.0.3  httpuv_1.3.3     chron_2.3-47    

Solution

  • There's not necessarily anything wrong with this result, though the bagging model could be better specified.

    Bagging doesn't necessarily always give you better performance statistics, rather it helps you avoid overfitting and improves accuracy.

    Thus the reason that your non-bagging model has better performance statistics may simply be that it's overfitting or otherwise producing a more biased result with misleading performance statistics.

    However, here's a much improved specification of the bagging model that gets the average logloss down by 70%:

    pacman::p_load(mlbench,mlr)
    
    data(PimaIndiansDiabetes)
    set.seed(1)
    
    trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")
    
    bagged.lrn     = makeBaggingWrapper(makeLearner("classif.logreg"), 
                                        bw.iters = 100, 
                                        bw.replace = TRUE, 
                                        bw.size = .6, 
                                        bw.feats = .5)
    bagged.lrn     = setPredictType(bagged.lrn,"prob")
    non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")
    
    rdesc = makeResampleDesc("CV", iters = 10L)
    
    resample(learner    = non.bagged.lrn, 
             task       = trainTask1, 
             resampling = rdesc, 
             show.info  = T,
             measures   = logloss)
    
    
    resample(learner    = bagged.lrn, 
             task       = trainTask1, 
             resampling = rdesc, 
             show.info  = T, 
             measures   = logloss)
    

    where the key result is

    Resample Result
    Task: PimaIndiansDiabetes
    Learner: classif.logreg.bagged
    logloss.aggr: 1.65
    logloss.mean: 1.65
    logloss.sd: 0.90
    Runtime: 14.0544