Search code examples
pythonplotlightgbm

lightgbm.create_tree_digraph() only show tree of one feature instead of all features


I have built a LightGBM model for classification purposes. I would like to build a tree from my LightGBM model using lightgbm.create_tree_digraph. But, instead of showing all features in a tree, it only show Purpose_Of_Loan feature. I would really like to see all of the features visualized in a tree. There are a total of 12 features in the training set. Any help would be greatly appreciated.

Features:

Loan_Amount_Requested      float64
Length_Employed            float64
Home_Owner                   int64
Annual_Income              float64
Income_Verified              int64
Purpose_Of_Loan              int64
Debt_To_Income             float64
Inquiries_Last_6Mo           int64
Months_Since_Deliquency    float64
Number_Open_Accounts         int64
Total_Accounts               int64
Gender                       int64

Note: Home_Owner, Income_Verified, Purpose_Of_Loan are categorical features.

classifier code:

clf = LGBMClassifier(nthread=4,
                    n_estimators=100, 
                    learning_rate=0.05,
                    bagging_fraction= 1, 
                    feature_fraction= 0.1, 
                    lambda_l1= 5, 
                    lambda_l2= 0, 
                    max_depth= 5, 
                    min_child_weight= 5, 
                    min_split_gain= 0.001, 
                    is_unbalance = True,
                    num_leaves= 36)

clf.fit(X, y)

plotting tree code:

viz = lightgbm.create_tree_digraph(clf.booster_)
viz

Output:

enter image description here


Solution

  • I would really like to see all of the features visualized in a tree

    LightGBM does not make any guarantees that every feature will be used.

    While growing each tree, LightGBM adds splits ((feature, threshold) combinations) one at a time, choosing the split which offers the best "gain" (change in the objective function). This is how multiple splits from one feature could be chosen in a tree, like in your example, and how features that are not very informative might never be chosen for any splits.

    After training, you can use Booster.feature_importance(importance_type="split") to check how often each feature was chosen for splits. You can pass keyword argument iteration to that method to view this information for only some iterations, e.g. iteration=5 to see how often each feature was chosen in the first 6 trees.

    Consider the following example (using lightgbm==3.3.1, in Python 3.8).

    import lightgbm as lgb
    import pandas as pd
    from sklearn.datasets import load_breast_cancer
    
    X, y = load_breast_cancer(return_X_y=True)
    data = lgb.Dataset(X, label=y)
    
    # train model
    bst = lgb.train(
        params={
            "objective": "binary",
            "verbose": -1
        },
        train_set=data,
        num_boost_round=10
    )
    
    # compute importances
    importance_df = (
        pd.DataFrame({
            'feature_name': bst.feature_name(),
            'importance_split': bst.feature_importance(importance_type='split', iteration=-1),
        })
        .sort_values('importance_split', ascending=False)
        .reset_index(drop=True)
    )
    print(importance_df)
    

    This code produces the following output.

       feature_name  importance_split
    0     Column_21                24
    1     Column_28                19
    2     Column_27                19
    3      Column_1                15
    4     Column_10                14
    

    That says "feature Column_21 was chosen for 24 splits, feature Column_28 was chosen for 19 splits, etc.".