Search code examples
pythonscikit-learnclassificationnltknaivebayes

Store most informative features from NLTK NaiveBayesClassifier in a list


i am trying this Naive Bayes Classifier in python:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)

i have the following output:

Console Output

It is clearly visible which words appear more in "important" and which in "spam" category.. But I can't work with these values.. I actually want a list that looks like this:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]

I am new to python and having a hard time figuring all these out, can anyone help ? I will be very thankful.


Solution

  • You could slightly modify the source code of show_most_informative_features to suit your purpose.

    The first element of the sub-list corresponds to the most informative feature name while the second element corresponds to it's label (more specifically the label associated with numerator term of the ratio).

    helper function:

    def show_most_informative_features_in_list(classifier, n=10):
        """
        Return a nested list of the "most informative" features 
        used by the classifier along with it's predominant labels
        """
        cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
        feature_list = []
        for (fname, fval) in classifier.most_informative_features(n):
            def labelprob(l):
                return cpdist[l, fname].prob(fval)
            labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                            key=labelprob)
            feature_list.append([fname, labels[-1]])
        return feature_list
    

    Testing this on a classifier trained over the positive/negative movie review corpus of nltk:

    show_most_informative_features_in_list(classifier, 10)
    

    produces:

    [['outstanding', 'pos'],
     ['ludicrous', 'neg'],
     ['avoids', 'pos'],
     ['astounding', 'pos'],
     ['idiotic', 'neg'],
     ['atrocious', 'neg'],
     ['offbeat', 'pos'],
     ['fascination', 'pos'],
     ['symbol', 'pos'],
     ['animators', 'pos']]