Search code examples
lightgbm

Difference between feature_importances_ and feature_importance() in lightgbm


There are two types of feature importance in LightGBM, namely feature_importance() for lightgbm.Booster and feature_importances_ for lightgbm.LGBMClassifier. feature_importance() contains two types of feature importance, "split" and "gain". My main question is: which one is used for feature_importances_ for Scikit-learn API? "split", "gain" or "mean decrease in impurity" as defaulted in scikit-learn?


Solution

  • By default, the .feature_importances_ property on a fitted lightgbm.sklearn estimator uses the "split" importance type.

    As described in LightGBM's docs (link), the estimators from lightgbm.sklearn take a keyword argument importance_type which controls what type of importance is returned by the feature_importances_ property.

    importance_type (str, optional (default='split')).

    The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.

    Here's an example using lightgbm==4.1.0 and Python 3.11.

    import lightgbm as lgb
    from sklearn.datasets import make_blobs
    
    # generate data
    X, y = make_blobs(
        n_samples=1_000,
        n_features=4,
        centers=2
    )
    
    # train a model
    clf = lgb.LGBMClassifier(
        n_estimators=10,
    ).fit(X,y)
    
    # .feature_importances defaults to "split"
    clf.feature_importances_
    # array([21,  1,  4,  0], dtype=int32)
    
    # if you set importance_type to "gain", it'll be the
    # cumulative gain from all splits involving that feature
    clf.importance_type = "gain"
    clf.feature_importances_
    # array([5.21306300e+03, 3.55271008e-15, 1.27897657e-13, 0.00000000e+00])