I have an XGBoost model and the Gini feature importance that I get if I pass in standardized data vs non-standardized data is completely different. I thought xgboost is immune to standardization. Can you please help me understand?
from xgboost import XGBClassifier
import pandas as pd
import matplotlib.pyplot as plt
def train_model(x_train, y_train, x_test, y_test):
xg = XGBClassifier(n_estimators=350, subsample=1.0, scale_pos_weight=25, min_child_weight=8, max_depth=21,
learning_rate=0.1, gamma=0.2, colsample_bytree=0.5)
xg.fit(x_train, y_train)
return xg
def find_feat_imp(model, cols):
feat_importance = pd.Series(model.feature_importances_, index=cols)
sort_feat_imp = feat_importance.sort_values(ascending=False)
sort_feat_imp
sort_feat_imp[:20].plot(kind='bar')
plt.show()
x_train, y_train, x_test, y_test = read_data()
xg = train_model(x_train, y_train, x_test, y_test)
find_feat_imp(xg,x_train.columns)
The output of standardized data is (https://i.sstatic.net/i6sW0.png) and the output of raw data is (https://i.sstatic.net/If5u8.png)
The important features are completely different. Can you please help me understand?
I was expecting the feature importance to be the same since xgboost is not affected by standardizing the data.
You haven't set a random seed, so the models may be different just due to random chance. In particular, because you set colsample_bytree<1
, you get a random effect in which columns are available per tree. Since the first trees are usually the most impactful, if a feature happens to be left out of them, its importance score will suffer. (Note however that xgboost supports several feature importance types; this effect might be more or less noticeable depending on which type you are using.)