While going through the LightGBM docs I found that predict
supports a pred_leaf
argument. The docs say
pred_leaf (bool, optional (default=False)) – Whether to predict
leaf index.
However, when doing a
data := (1, 28)
gbm := num_boost_round = X
embedding = gbm.predict(data, pred_leaf=True)
embedding.shape # [1, X]
print(embedding[0, :]) # [29, 2, 8, 26, 2, 2, 16, 18, 25, 30, 16, 25, 0, 17, 15]
I don't understand why it is outputting an array that is filled as opposed to a one-hot vector or a scalar value? It says it predicts the leaf index? Can this be used as an "embedding" to another model?
Ps: I'd post this in stats-stackexchange but it looks like this is 1) specific to lightgbm and 2) they don't have a lightgbm tag
The output of LightGBM predict
with pred_leaf
argument set to True is an array of shape(nsample, ntrees) containing int32 values.
Each integer entry in the matrix indicates the predicted leaf index of each sample in each tree.
Since the leaf index of a tree is unique per tree, you may find the same leaf number in many different columns.
As for as its behaviour, this LightGBM prediction function mimicks an analogous one present in XGBoost (https://xgboost.readthedocs.io/en/latest/python/python_api.html).