python machine-learning scikit-learn knn

How to get the most contributing feature in any classifier Sklearn for example DecisionTreeClassifier knn etc

I have tried my model on a data set using KNN classifier , I would like to know which is the most contributing feature in the model, and most contributing feature in the prediction.

Solution

To gain qualitative insight into which feature has greater impact on classification you could perform n_feats classifications using one single feature at a time (n_feats stands for the feature vector dimension), like this:

import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

iris = datasets.load_iris()

clf = KNeighborsClassifier()

y =  iris.target
n_feats = iris.data.shape[1]

print('Feature  Accuracy')
for i in range(n_feats):
    X = iris.data[:, i].reshape(-1, 1)
    scores = cross_val_score(clf, X, y, cv=3)
    print(f'{i}        {scores.mean():g}')

Output:

Feature  Accuracy
0        0.692402
1        0.518382
2        0.95384
3        0.95384

These results suggest that classification is dominated by features 2 and 3.

You could follow an alternative approach by replacing X = iris.data[:, i].reshape(-1, 1) in the code above by:

    X_head = np.atleast_2d(iris.data[:, 0:i])
    X_tail = np.atleast_2d(iris.data[:, i+1:])
    X = np.hstack((X_head, X_tail))

In this case you are performing n_samplesclassifications as well. The difference is that the feature vector used in the i-th classification is made up of all the features but the i-th.

Sample run:

Feature  Accuracy
0        0.973856
1        0.96732
2        0.946895
3        0.959967

It clearly emerges from these results that the classifier yields the worst accuracy when you get rid of the third feature (feature of index 2), which is consistent with the results obtained through the first approach.