Search code examples
rlightgbm

What's the difference between lgb.train() and lightgbm() in r?


I'm trying to build a regression model with R using lightGBM, and i'm getting a bit confused with some functions and when/how to use them.

First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?

The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model. One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.

Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model? As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model. Is it correct to use it like the example below?

lgbcv <- lgb.cv(params,
            lgbtrain,
            nrounds = 1000,
            nfold = 5, 
            early_stopping_rounds = 100,
            learning_rate = 1.0)

lgbcv <- lightgbm(params,
               lgbtrain,
               nrounds = 1000,
               early_stopping_rounds = 100,
               learning_rate = 1.0)

Thank you in advance!


Solution

  • what's the difference between lgb.train() and lightgbm()?

    These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.

    Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").

    that lgb.train() does not work with valids = , while lightgbm() does.

    This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.

    How do you apply the output of lgb.cv() to a model?

    I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.

    library(lightgbm)
    data(agaricus.train, package = "lightgbm")
    train <- agaricus.train
    dtrain <- lgb.Dataset(train$data, label = train$label)
    params <- list(objective = "regression", metric = "l2")
    model <- lgb.cv(
      params = params
      , data = dtrain
      , nrounds = 5L
      , nfold = 3L
      , min_data = 1L
      , learning_rate = 1.0
    )
    

    This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).

    You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).