I have built a LightGBM model for classification purposes. I would like to build a tree from my LightGBM model using lightgbm.create_tree_digraph
. But, instead of showing all features in a tree, it only show Purpose_Of_Loan
feature. I would really like to see all of the features visualized in a tree. There are a total of 12 features in the training set. Any help would be greatly appreciated.
Features:
Loan_Amount_Requested float64
Length_Employed float64
Home_Owner int64
Annual_Income float64
Income_Verified int64
Purpose_Of_Loan int64
Debt_To_Income float64
Inquiries_Last_6Mo int64
Months_Since_Deliquency float64
Number_Open_Accounts int64
Total_Accounts int64
Gender int64
Note: Home_Owner, Income_Verified, Purpose_Of_Loan
are categorical features.
classifier code:
clf = LGBMClassifier(nthread=4,
n_estimators=100,
learning_rate=0.05,
bagging_fraction= 1,
feature_fraction= 0.1,
lambda_l1= 5,
lambda_l2= 0,
max_depth= 5,
min_child_weight= 5,
min_split_gain= 0.001,
is_unbalance = True,
num_leaves= 36)
clf.fit(X, y)
plotting tree code:
viz = lightgbm.create_tree_digraph(clf.booster_)
viz
Output:
I would really like to see all of the features visualized in a tree
LightGBM does not make any guarantees that every feature will be used.
While growing each tree, LightGBM adds splits ((feature, threshold)
combinations) one at a time, choosing the split which offers the best "gain" (change in the objective function). This is how multiple splits from one feature could be chosen in a tree, like in your example, and how features that are not very informative might never be chosen for any splits.
After training, you can use Booster.feature_importance(importance_type="split")
to check how often each feature was chosen for splits. You can pass keyword argument iteration
to that method to view this information for only some iterations, e.g. iteration=5
to see how often each feature was chosen in the first 6 trees.
Consider the following example (using lightgbm==3.3.1
, in Python 3.8).
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={
"objective": "binary",
"verbose": -1
},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_split': bst.feature_importance(importance_type='split', iteration=-1),
})
.sort_values('importance_split', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
This code produces the following output.
feature_name importance_split
0 Column_21 24
1 Column_28 19
2 Column_27 19
3 Column_1 15
4 Column_10 14
That says "feature Column_21
was chosen for 24 splits, feature Column_28
was chosen for 19 splits, etc.".