Search code examples
pythonmachine-learningscikit-learnknn

How to get the most contributing feature in any classifier Sklearn for example DecisionTreeClassifier knn etc


I have tried my model on a data set using KNN classifier , I would like to know which is the most contributing feature in the model, and most contributing feature in the prediction.


Solution

  • To gain qualitative insight into which feature has greater impact on classification you could perform n_feats classifications using one single feature at a time (n_feats stands for the feature vector dimension), like this:

    import numpy as np
    from sklearn import datasets
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import cross_val_score
    
    iris = datasets.load_iris()
    
    clf = KNeighborsClassifier()
    
    y =  iris.target
    n_feats = iris.data.shape[1]
    
    print('Feature  Accuracy')
    for i in range(n_feats):
        X = iris.data[:, i].reshape(-1, 1)
        scores = cross_val_score(clf, X, y, cv=3)
        print(f'{i}        {scores.mean():g}')
    

    Output:

    Feature  Accuracy
    0        0.692402
    1        0.518382
    2        0.95384
    3        0.95384
    

    These results suggest that classification is dominated by features 2 and 3.

    You could follow an alternative approach by replacing X = iris.data[:, i].reshape(-1, 1) in the code above by:

        X_head = np.atleast_2d(iris.data[:, 0:i])
        X_tail = np.atleast_2d(iris.data[:, i+1:])
        X = np.hstack((X_head, X_tail))
    

    In this case you are performing n_samplesclassifications as well. The difference is that the feature vector used in the i-th classification is made up of all the features but the i-th.

    Sample run:

    Feature  Accuracy
    0        0.973856
    1        0.96732
    2        0.946895
    3        0.959967
    

    It clearly emerges from these results that the classifier yields the worst accuracy when you get rid of the third feature (feature of index 2), which is consistent with the results obtained through the first approach.