Search code examples
machine-learningxibxgboost

XGBoost plot_importance cannot show feature names


I used the plot_importance to show me the importance variables. But some variables are categorical, so I did some transformation. After I transformed the type of the variables, when i plot importance features, the plot does not show me feature names. I attached my code, and the plot. dataset = data.values X = dataset[1:100,0:-2]

predictors=dataset[1:100,-1]

X = X.astype(str)
encoded_x = None
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    feature = label_encoder.fit_transform(X[:,i])
    feature = feature.reshape(X.shape[0], 1)
    onehot_encoder = OneHotEncoder(sparse=False)
    feature = onehot_encoder.fit_transform(feature)
    if encoded_x is None:
        encoded_x = feature
    else:
        encoded_x = np.concatenate((encoded_x, feature), axis=1)
print("X shape: : ", encoded_x.shape)


response='Default'
#predictors=list(data.columns.values[:-1])



# Randomly split indexes
X_train, X_test, y_train, y_test = train_test_split(encoded_x,predictors,train_size=0.7, random_state=5)



model = XGBClassifier()
model.fit(X_train, y_train)


plot_importance(model)
plt.show()

[enter image description here][1]


  [1]: https://i.sstatic.net/M9qgY.png

Solution

  • This is the expected behaviour- sklearn.OneHotEncoder.transform() returns a numpy 2d array instead of the input pd.DataFrame (i assume that's the type of your dataset). So it is not a bug, but a feature. It doesn't look like there is a way to pass feature names manually in the sklearn API (it is possible to set those in xgb.Dmatrix creation in the native training API).

    However, your problem is easily solvable with pd.get_dummies() instead of the LabelEncoder + OneHotEncoder combination that you have implemented. I do not know why did you choose to use it instead (it can be useful, if you need to handle also a test set but then you need to play extra tricks), but i would advise in favour of pd.get_dummies()