Search code examples
machine-learningnlpnltktext-classificationdocument-classification

nltk naivebayes classifier for text classification


In following code, I know that my naivebayes classifier is working correctly because it is working correctly on trainset1 but why is it not working on trainset2? I even tried it on two classifiers, one from TextBlob and other directly from nltk.

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk

trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
         ('hello i was there and no one came', 'class2'),
         ('all negative terms like sad angry etc', 'class2')]

def nltk_naivebayes(trainset, test_sentence):
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
    classifier = nltk.NaiveBayesClassifier.train(t)
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
    return classifier.classify(test_sent_features)

def textblob_naivebayes(trainset, test_sentence):
    cl = NaiveBayesClassifier(trainset)
    blob = TextBlob(test_sentence,classifier=cl)
    return blob.classify() 

test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"

print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)

Output:

neg
class2
neg
class2

Although test_sentence2 clearly belongs to class1.


Solution

  • I will assume your understand that you cannot expect a classifier to learn a good model with only 3 examples, and that your question is more to understand why it does that in this specific example.

    The likely reason it does that is that naive bayes classifier uses a prior class probability. That is, the probability of neg vs pos, regardless of the text. In your case, 2/3 of the examples are negative, thus the prior is 66% for neg and 33% for pos. The positive words in your single positive instance are 'anniversary' and 'soaring', which are unlikely to be enough to compensate this prior class probability.

    In particular, be aware that the calculation of word probabilities involve various 'smoothing' functions (for instance, it will be log10(Term Frequency + 1) in each class, not log10(Term Frequency) to prevent low frequency words to impact too much the classification results, divisions by zero, etc. Thus the probabilities for "anniversary" and "soaring" are not 0.0 for neg and 1.0 for pos, unlike what you may have expected.