Search code examples
pythonlightgbmboosting

Pred_leaf in lightgbm


While going through the LightGBM docs I found that predict supports a pred_leaf argument. The docs say

pred_leaf (bool, optional (default=False)) – Whether to predict
leaf index.

However, when doing a

data := (1, 28)
gbm := num_boost_round = X

embedding = gbm.predict(data, pred_leaf=True)
embedding.shape  # [1, X]
print(embedding[0, :])  # [29,  2,  8, 26,  2,  2, 16, 18, 25, 30, 16, 25,  0, 17, 15]

I don't understand why it is outputting an array that is filled as opposed to a one-hot vector or a scalar value? It says it predicts the leaf index? Can this be used as an "embedding" to another model?

Ps: I'd post this in stats-stackexchange but it looks like this is 1) specific to lightgbm and 2) they don't have a lightgbm tag


Solution

  • The output of LightGBM predict with pred_leaf argument set to True is an array of shape(nsample, ntrees) containing int32 values.

    Each integer entry in the matrix indicates the predicted leaf index of each sample in each tree.

    Since the leaf index of a tree is unique per tree, you may find the same leaf number in many different columns.

    As for as its behaviour, this LightGBM prediction function mimicks an analogous one present in XGBoost (https://xgboost.readthedocs.io/en/latest/python/python_api.html).