Search code examples
rmachine-learningbootstrap-4decision-tree

Why is the error rate from bagging trees much higher than that from a single tree?


I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.


I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.

I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.

# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)

# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)

# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
      {
      current.train <- a.sample[vector.of.indices, ]
      current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
      predict(current.fitted.model, test.set, type = "class")
      }

for (j in 1:M)
      {
      # Split the data into test/train sets
      train.idx <- sample(1:n, ntrain, replace = FALSE)
      train.set <- PimaIndiansDiabetes[train.idx, ]
      test.set <- PimaIndiansDiabetes[-train.idx, ]

      # Train a direct tree model
      fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
      pred.test <- predict(fitted.tree, test.set, type = "class")
      single.tree.error[j] <- mean(pred.test != test.set$diabetes)


      # Bootstrap estimates
      res.boot = boot(train.set, estim.pred, B)
      pred.boot <- vector(length = ntest)
      for (i in 1:ntest)
            {
            pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos")  >= 0.5, "pos", "neg")
            }
      bagging.error[j] <- mean(pred.boot != test.set$diabetes)
      }

boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))

The result is

enter image description here

Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.


Solution

  • I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.