Search code examples
pythonmachine-learningxgboostfeature-selectionfeature-engineering

How to get the highest accuracy with low number of selected features using xgboost?


I have been looking for several feature selection methods and found about the feature selection with help of XGBoost from the following link (XGBoost feature importance and selection). I implemented the method for my case, and results were the following:

  • Thresh= 0.000, n= 11, Accuracy: 55.56%
  • Thresh= 0.000, n= 11, Accuracy: 55.56%
  • Thresh= 0.000, n= 11, Accuracy: 55.56%
  • Thresh= 0.000, n= 11, Accuracy: 55.56%
  • Thresh= 0.097, n= 7, Accuracy: 55.56%
  • Thresh= 0.105, n= 6, Accuracy: 55.56%
  • Thresh= 0.110, n= 5, Accuracy: 50.00%
  • Thresh= 0.114, n= 4, Accuracy: 50.00%
  • Thresh= 0.169, n= 3, Accuracy: 44.44%
  • Thresh= 0.177, n= 2, Accuracy: 38.89%
  • Thresh= 0.228, n= 1, Accuracy: 33.33%

So, my question is the following, for this case how can I select the highest accuracy with a low number of features [n]? [The code can be found in the link]

Edit 1:

Thanks to @Mihai Petre, I managed to get it to work with code in his answer. I have another question, say I ran the code from the link and got the following:

Feature Importance results = [29.205832   5.0182242  0.         0.         0. 6.7736177 16.704327  18.75632    9.529003  14.012676   0.       ]
Features = [ 0  7  6  9  8  5  1 10  4  3  2]
  • Thresh= 0.000, n= 11, Accuracy: 38.89%
  • Thresh= 0.000, n= 11, Accuracy: 38.89%
  • Thresh= 0.000, n= 11, Accuracy: 38.89%
  • Thresh= 0.000, n= 11, Accuracy: 38.89%
  • Thresh= 0.050, n= 7, Accuracy: 38.89%
  • Thresh= 0.068, n= 6, Accuracy: 38.89%
  • Thresh= 0.095, n= 5, Accuracy: 33.33%
  • Thresh= 0.140, n= 4, Accuracy: 38.89%
  • Thresh= 0.167, n= 3, Accuracy: 33.33%
  • Thresh= 0.188, n= 2, Accuracy: 38.89%
  • Thresh= 0.292, n= 1, Accuracy: 38.89%

How can I remove the features that gave out zero feature importance and get the features with feature importance values?

Side Questions:

  1. I am trying to find the best feature selection that involves using specific classification model and best features that help to give high accuracy, say, for example, using KNN classifier and would like to find the best features that give out high accuracy. What feature selection would be appropriate to use?
  2. When Implementing multiple classification models, is it best to do feature selection for each classification model or you need to do feature selection once and then use the selected features to multiple classification models?

Solution

  • Ok, so what the guy in your link is doing with

    thresholds = sort(model.feature_importances_)
    for thresh in thresholds:
        # select features using threshold
        selection = SelectFromModel(model, threshold=thresh, prefit=True)
        select_X_train = selection.transform(X_train)
        # train model
        selection_model = XGBClassifier()
        selection_model.fit(select_X_train, y_train)
        # eval model
        select_X_test = selection.transform(X_test)
        predictions = selection_model.predict(select_X_test)
        accuracy = accuracy_score(y_test, predictions)
        print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
    

    is to create a sorted array of thresholds and then he trains the XGBoost for every element of the thresholds array.

    From your question, I'm thinking you want to only select the 6th case, the one with the lowest number of features and highest accuracy. For this case, you'd want to do something like this:

    selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))
    

    If you want to automate the whole thing, then you'd want to calculate the minimum n for which the accuracy is at its maximum inside that for loop, and it would look more or less like this:

    n_min = *your maximum number of used features*
    acc_max = 0
    thresholds = sort(model.feature_importances_)
    obj_thresh = thresholds[0]
    for thresh in thresholds:
        selection = SelectFromModel(model, threshold=thresh, prefit=True)
        select_X_train = selection.transform(X_train)
        selection_model = XGBClassifier()
        selection_model.fit(select_X_train, y_train)
        select_X_test = selection.transform(X_test)
        predictions = selection_model.predict(select_X_test)
        accuracy = accuracy_score(y_test, predictions)
        if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
            n_min = select_X_train.shape[1]
            acc_max = accuracy
            obj_thresh = thresh
    
    selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))