Search code examples
pythondata-sciencexgboost

xgboost Gini Feature Importance - standardized data vs raw data


I have an XGBoost model and the Gini feature importance that I get if I pass in standardized data vs non-standardized data is completely different. I thought xgboost is immune to standardization. Can you please help me understand?

from xgboost import XGBClassifier
import pandas as pd
import matplotlib.pyplot as plt


  def train_model(x_train, y_train, x_test, y_test):
    xg = XGBClassifier(n_estimators=350, subsample=1.0, scale_pos_weight=25, min_child_weight=8, max_depth=21,
                                    learning_rate=0.1, gamma=0.2, colsample_bytree=0.5)
        xg.fit(x_train, y_train)
        return xg

def find_feat_imp(model, cols):
    feat_importance = pd.Series(model.feature_importances_, index=cols)
    sort_feat_imp = feat_importance.sort_values(ascending=False)
    sort_feat_imp
        sort_feat_imp[:20].plot(kind='bar')
    plt.show()

x_train, y_train, x_test, y_test  = read_data()
xg = train_model(x_train, y_train, x_test, y_test)
find_feat_imp(xg,x_train.columns)

The output of standardized data is (https://i.sstatic.net/i6sW0.png) and the output of raw data is (https://i.sstatic.net/If5u8.png)

The important features are completely different. Can you please help me understand?

I was expecting the feature importance to be the same since xgboost is not affected by standardizing the data.


Solution

  • You haven't set a random seed, so the models may be different just due to random chance. In particular, because you set colsample_bytree<1, you get a random effect in which columns are available per tree. Since the first trees are usually the most impactful, if a feature happens to be left out of them, its importance score will suffer. (Note however that xgboost supports several feature importance types; this effect might be more or less noticeable depending on which type you are using.)