I have tried my model on a data set using KNN classifier , I would like to know which is the most contributing feature in the model, and most contributing feature in the prediction.
To gain qualitative insight into which feature has greater impact on classification you could perform n_feats
classifications using one single feature at a time (n_feats
stands for the feature vector dimension), like this:
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
clf = KNeighborsClassifier()
y = iris.target
n_feats = iris.data.shape[1]
print('Feature Accuracy')
for i in range(n_feats):
X = iris.data[:, i].reshape(-1, 1)
scores = cross_val_score(clf, X, y, cv=3)
print(f'{i} {scores.mean():g}')
Output:
Feature Accuracy
0 0.692402
1 0.518382
2 0.95384
3 0.95384
These results suggest that classification is dominated by features 2 and 3.
You could follow an alternative approach by replacing X = iris.data[:, i].reshape(-1, 1)
in the code above by:
X_head = np.atleast_2d(iris.data[:, 0:i])
X_tail = np.atleast_2d(iris.data[:, i+1:])
X = np.hstack((X_head, X_tail))
In this case you are performing n_samples
classifications as well. The difference is that the feature vector used in the i-th classification is made up of all the features but the i-th.
Sample run:
Feature Accuracy
0 0.973856
1 0.96732
2 0.946895
3 0.959967
It clearly emerges from these results that the classifier yields the worst accuracy when you get rid of the third feature (feature of index 2), which is consistent with the results obtained through the first approach.