python-2.7 scikit-learn text-classification naivebayes

How to identify the ID / name / title of the misclassified text file with sci-kit learn

I am buidling my own classifier for text classification but at the moment I am playing with sci-kit learn in order to figure out few things. I classified few of my text files using NB classifier. I am using 26 text files manually categorised into 2 categories, with each of the files being numbered between 01 - 26 (i.e. '01.txt' and so forth).

My code and results:

import sklearn
from sklearn.datasets import load_files
import numpy as np
bunch = load_files('corpus')

split_pcnt = 0.75 
split_size = int(len(bunch.data) * split_pcnt)
X_train = bunch.data[:split_size]
X_test = bunch.data[split_size:]
y_train = bunch.target[:split_size]
y_test = bunch.target[split_size:]

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

clf_1 = Pipeline([('vect', CountVectorizer()),
                      ('clf', MultinomialNB()),
    ])

from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score >>> method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

clfs = [clf_1]

for clf in clfs:
    evaluate_cross_validation(clf, bunch.data, bunch.target, 5)

[ 0.5  0.4  0.4  0.4  0.6]
Mean score: 0.460 (+/-0.040)

from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):

    clf.fit(X_train, y_train)

    print "Accuracy on training set:"
    print clf.score(X_train, y_train)
    print "Accuracy on testing set:"
    print clf.score(X_test, y_test)
    y_pred = clf.predict(X_test)

    print "Classification Report:"
    print metrics.classification_report(y_test, y_pred)
    print "Confusion Matrix:"
    print metrics.confusion_matrix(y_test, y_pred)


train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)

Accuracy on training set:
1.0

Accuracy on testing set:
0.714285714286

    Classification Report:
                 precision    recall  f1-score   support

              0       0.67      0.67      0.67         3
              1       0.75      0.75      0.75         4

    avg / total       0.71      0.71      0.71         7

    Confusion Matrix:
    [[2 1]
     [1 3]]

What I cannot figure out is how to identify the IDs of the misclassified files, to see which exact files are misclassified (e.g. '05.txt' and '23.txt'). Is this possible with sci-kit learn at all?

best,

guzdeh

Solution

Yes, you have to use the attribute filenames of the load_files result.

However you have two model training and evaluation cycles in your example code: one using CV and another using simple train-test split.

In the train-test split:

test_filenames = bunch.filenames[split_size:]
misclassified = (y_pred != y_test)
print test_filenames[misscalssified]

This answer does not assume that the text files are in alphabetical order or that all numbers are present.