Search code examples
pythonpandasnltknaivebayes

Classifying text strings into multiple classes using Naive Bayes with NLTK


I'm currently using Naive Bayes to classify a bunch of texts. I have multiple categories. Right now I just output the posterior probability and the category, but what I would like to do is rank the categories based on the posterior probabilities and use the 2nd, 3rd place categories as "back up" categories.

Here's an example:

df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})

text           true_cat
-----------------------
I have wings   bird
Metal wings    plane
Feathers       bird
Airport        plane

What I'm doing:

new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))

Eventual Output:

new_cat prob_cat    text           true_cat
bird    0.67        I have wings   bird
bird    0.6         Feathers       bird
bird    0.51        Metal wings    plane
plane   0.8         Airport        plane

I have found a couple examples using classify_many and prob_classify_many but since I'm new to Python I'm having trouble translating it to my problem. I haven't seen it used with pandas anywhere.

I want it to look like this:

df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})


new_cat1    new_cat2    prob_cat1   prob_cat2   text           true_cat
-----------------------------------------------------------------------
bird        plane       0.67        0.33        I have wings   bird
bird        plane       0.51        0.49        Metal wings    plane
bird        plane       0.6         0.4         Feathers       bird
plane       bird        0.8         0.2         Airport        plane

Any help would be appreciated.


Solution

  • I'm treating your self-answer as part of your question. Presumably you got the probability of the classification bird like this:

    prob_cat.prob("bird")
    

    Here, prob_cat is an nltk probability distribution (ProbDist). You can get all categories in a discrete ProbDist and their probability like this:

    probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())
    

    Since you already know the categories you trained with, you can use a predefined list instead of prob_cat.samples(). Finally, you can order them from the most to the least probable in the same expression:

    mycategories = ["bird", "plane"]
    probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])