I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris
data set using gbm
in the gbm
package. I set the number of iterations to M = 1000
with a learning rate of learning.rate = 0.001
. I then compared the results to those of a regression tree (using rpart
). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.
library(rpart)
library(gbm)
data(iris)
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M,
interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)
tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)
> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.
library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:
set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)
gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm", distribution="gaussian",tuneGrid=tg,trControl=tr)
Then we use the same samples to fit rpart:
#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)
Finally we compare them, and they are very similar:
gbm_fit$resample
RMSE Rsquared MAE Resample
1 0.3459311 0.5000575 0.2585884 0
2 0.3421506 0.4536114 0.2631338 1
3 0.3428588 0.5600722 0.2693837 2
RMSE Rsquared MAE Resample
1 0.3492542 0.3791232 0.2695451 0
2 0.3320841 0.4276960 0.2550386 1
3 0.3284239 0.4343378 0.2570833 2
So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.