I'm currently using Naive Bayes to classify a bunch of texts. I have multiple categories. Right now I just output the posterior probability and the category, but what I would like to do is rank the categories based on the posterior probabilities and use the 2nd, 3rd place categories as "back up" categories.
Here's an example:
df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})
text true_cat
-----------------------
I have wings bird
Metal wings plane
Feathers bird
Airport plane
What I'm doing:
new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))
Eventual Output:
new_cat prob_cat text true_cat
bird 0.67 I have wings bird
bird 0.6 Feathers bird
bird 0.51 Metal wings plane
plane 0.8 Airport plane
I have found a couple examples using classify_many and prob_classify_many but since I'm new to Python I'm having trouble translating it to my problem. I haven't seen it used with pandas anywhere.
I want it to look like this:
df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})
new_cat1 new_cat2 prob_cat1 prob_cat2 text true_cat
-----------------------------------------------------------------------
bird plane 0.67 0.33 I have wings bird
bird plane 0.51 0.49 Metal wings plane
bird plane 0.6 0.4 Feathers bird
plane bird 0.8 0.2 Airport plane
Any help would be appreciated.
I'm treating your self-answer as part of your question. Presumably you got the probability of the classification bird
like this:
prob_cat.prob("bird")
Here, prob_cat
is an nltk probability distribution (ProbDist
). You can get all categories in a discrete ProbDist
and their probability like this:
probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())
Since you already know the categories you trained with, you can use a predefined list instead of prob_cat.samples()
. Finally, you can order them from the most to the least probable in the same expression:
mycategories = ["bird", "plane"]
probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])