I have been looking for several feature selection methods and found about the feature selection with help of XGBoost from the following link (XGBoost feature importance and selection). I implemented the method for my case, and results were the following:
So, my question is the following, for this case how can I select the highest accuracy with a low number of features [n]? [The code can be found in the link]
Edit 1:
Thanks to @Mihai Petre, I managed to get it to work with code in his answer. I have another question, say I ran the code from the link and got the following:
Feature Importance results = [29.205832 5.0182242 0. 0. 0. 6.7736177 16.704327 18.75632 9.529003 14.012676 0. ]
Features = [ 0 7 6 9 8 5 1 10 4 3 2]
How can I remove the features that gave out zero feature importance and get the features with feature importance values?
Side Questions:
Ok, so what the guy in your link is doing with
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
is to create a sorted array of thresholds and then he trains the XGBoost for every element of the thresholds
array.
From your question, I'm thinking you want to only select the 6th case, the one with the lowest number of features and highest accuracy. For this case, you'd want to do something like this:
selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))
If you want to automate the whole thing, then you'd want to calculate the minimum n for which the accuracy is at its maximum inside that for loop, and it would look more or less like this:
n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))