Search code examples
machine-learninglightgbm

In lightgbm why do the train and the cv APIs accept categorical_feature argument when it is already present in the dataset construction


The Following are the .cv APIs of lightgbm

lightgbm.cv(params, train_set, num_boost_round=100, folds=None, nfold=5, stratified=True, shuffle=True, metrics=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', fpreproc=None, seed=0, callbacks=None, eval_train_metric=False, return_cvbooster=False)

There is a parameter cateogrical_feature

Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well).

Now the .train API

lightgbm.train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', keep_training_booster=False, callbacks=None)

Here also there is a categorical_feature parameter. The documentation for this is the same as above

Now, as you notice both the APIs consume the lightgbm dataset which, itself takes a categorical_feature parameter. The documentation is exactly the same

Questions:

  1. If both are specified which one takes precedence?
  2. Which one is the suggested place to specify the categorical_feature?
  3. Are the two choices in any way different internally to the working of the lightgbm pipeline?

Solution

  • These competing patterns in lightgbm.cv() in the lightgbm package have been in the library since September 2017 (this commit). The ability to specify that in both interfaces was added mainly for convenience. It isn't functionally different from passing those arguments to lightgbm.Dataset().

    If both are specified which one takes precedence?

    Which one is the suggested place to specify the categorical_feature?

    Are the two choices in any way different internally to the working of the lightgbm pipeline?

    Always prefer passing it to lightgbm.Dataset, and ignore the argument to lightgbm.cv() / lightgbm.train().

    The categorical_feature argument passed into lightgbm.cv() / lightgbm.train() is only used in one place, in a call to Dataset.set_categorical_feature() inside the lightgbm.cv() / lightgbm.train() function. At best, this will be useless and not update the Dataset.

    At worst, it can cause an error if the raw data is no longer available.

    import lightgbm as lgb
    from sklearn.datasets import make_regression
    
    X, y = make_regression(n_samples=1_000, n_features=10)
    
    dtrain = lgb.Dataset(
        X,
        label=y,
        categorical_feature=[1, 4],
        free_raw_data=True
    )
    dtrain.construct()
    
    bst = lgb.train(
      params={"objective": "regression"},
      train_set=dtrain,
      categorical_feature=[1, 3]
    )
    # lightgbm.basic.LightGBMError: Cannot set categorical feature after freed raw data,
    # set free_raw_data=False when construct Dataset to avoid this.
    

    Which one is the suggested place to specify the categorical_feature?