Search code examples
pythonscikit-learntext-classification

Scikit learn-Classification


Is there a straightforward way to view the top features of each class? Based on tfidf?

I am using KNeighbors classifer, SVC-Linear, MultinomialNB.

Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.

classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()  
predictions = classifier.predict(counts)

EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.


Solution

  • Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.

    I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.

    In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.

    II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.

    import pandas as pd
    from sklearn.linear_model import RandomForestClassifier
    
    df = pd.DataFrame(data)
    x = df[[<list of feature columns>]]
    y = df[<target column>]
    
    mod = RandomForestClassifier()
    mod.fit(x.values, y.values)
    
    df['predict'] = mod.predict(x.values)
    
    incorrect = df[df['predict']!=df[<target column>]]
    

    The resultant incorrect DataFrame will contain only records which are misclassified.

    Hope this helps!