Search code examples
scikit-learndecision-treefeature-selection

how to return the features that used in decision tree that created by DecisionTreeClassifier in sklearn


i want to do feature selection on my data set by CART and C4.5 decision tree. In such a way that apply decision tree on data set and then extract the features that decision tree algorithm use to create the tree. so i need return the features that use in the created tree. i use "DecisionTreeClassifier" in sklearn.tree module. i need a method or function to give me (return) the features that used in created tree!! to use this features as more important features in main modulation algorithm.


Solution

  • You can approach the problem similar to the below:

    I assume you have the train (x_train, y_train) and test (x_test, y_test) sets.

    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    
    tree_clf1 = DecisionTreeClassifier().fit(x_train, y_train)
    
    y_pred = tree_clf1.predict(x_test)
    
    print(confusion_matrix(y_test, y_pred))
    print("\n\nAccuracy:{:,.2f}%".format(accuracy_score(y_test, y_pred)*100))
    print("Precision:{:,.2f}%".format(precision_score(y_test, y_pred)*100))
    print("Recall:{:,.2f}%".format(recall_score(y_test, y_pred)*100))
    print("F1-Score:{:,.2f}%".format(f1_score(y_test, y_pred)*100))
    
    feature_importances = DataFrame(tree_clf1.feature_importances_,
                                    index = x_train.columns,
                                    columns['importance']).sort_values('importance', 
                                                                        ascending=False)
    
    print(feature_importances)
    

    Below is an example output, shows which features are important for your classification.

    enter image description here