Search code examples
rmachine-learningparallel-processinggbm

Is there a parallel implementation of GBM in R?


I use the gbm library in R and I would like to use all my CPU to fit a model.

gbm.fit(x, y,
        offset = NULL,
        misc = NULL,...

Solution

  • As far as I know, both h2o and xgboost have this.

    For h2o, see e.g. this blog post of theirs from 2013 from which I quote

    At 0xdata we build state-of-the-art distributed algorithms - and recently we embarked on building GBM, and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II, Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bottom of this post). Most of the algorithm is straightforward “small” math, but step 2.b.ii says “Fit a regression tree to the targets….”, i.e. fit a regression tree in the middle of the inner loop, for targets that change with each outer loop. This is where we decided to distribute/parallelize.

    The platform we build on is H2O, and as talked about in the prior blog has an API focused on doing big parallel vector operations - and for GBM (and also Random Forest) we need to do big parallel tree operations. But not really any tree operation; GBM (and RF) constantly build trees - and the work is always at the leaves of a tree, and is about finding the next best split point for the subset of training data that falls into a particular leaf.

    The code can be found on our git: http://0xdata.github.io/h2o/

    (Edit: The repo now is at https://github.com/h2oai/.)

    The other parallel GBM implementation is, I think, in xgboost. Its Descriptions says

    Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. This package is its R interface. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.