Search code examples
rgradientgbmboosting

R: how to improve gradient boosting model fit


I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.

library(rpart)
library(gbm)
data(iris)

train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]

learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M, 
    interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)

tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)

> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438

Solution

  • In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.

    library(ggplot2)
    ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
    

    enter image description here

    Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:

    set.seed(999)
    idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
    tr = trainControl(method="cv",index=idx)
    this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
    n.minobsinnode=10,n.trees=1000)
    

    gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm", distribution="gaussian",tuneGrid=tg,trControl=tr)

    Then we use the same samples to fit rpart:

    #the default for rpart
    this_grid = data.frame(cp=0.01)
    rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
    trControl=tr,tuneGrid=this_grid)
    

    Finally we compare them, and they are very similar:

    gbm_fit$resample
           RMSE  Rsquared       MAE Resample
    1 0.3459311 0.5000575 0.2585884        0
    2 0.3421506 0.4536114 0.2631338        1
    3 0.3428588 0.5600722 0.2693837        2
    
           RMSE  Rsquared       MAE Resample
    1 0.3492542 0.3791232 0.2695451        0
    2 0.3320841 0.4276960 0.2550386        1
    3 0.3284239 0.4343378 0.2570833        2
    

    So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.