There are two types of feature importance in LightGBM, namely feature_importance() for lightgbm.Booster and feature_importances_ for lightgbm.LGBMClassifier. feature_importance() contains two types of feature importance, "split" and "gain". My main question is: which one is used for feature_importances_ for Scikit-learn API? "split", "gain" or "mean decrease in impurity" as defaulted in scikit-learn?
By default, the .feature_importances_
property on a fitted lightgbm.sklearn
estimator uses the "split"
importance type.
As described in LightGBM's docs (link), the estimators from lightgbm.sklearn
take a keyword argument importance_type
which controls what type of importance is returned by the feature_importances_
property.
importance_type (str, optional (default='split')).
The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
Here's an example using lightgbm==4.1.0
and Python 3.11.
import lightgbm as lgb
from sklearn.datasets import make_blobs
# generate data
X, y = make_blobs(
n_samples=1_000,
n_features=4,
centers=2
)
# train a model
clf = lgb.LGBMClassifier(
n_estimators=10,
).fit(X,y)
# .feature_importances defaults to "split"
clf.feature_importances_
# array([21, 1, 4, 0], dtype=int32)
# if you set importance_type to "gain", it'll be the
# cumulative gain from all splits involving that feature
clf.importance_type = "gain"
clf.feature_importances_
# array([5.21306300e+03, 3.55271008e-15, 1.27897657e-13, 0.00000000e+00])