Search code examples
pythonclassificationnltktext-classification

trouble with nltk python NaiveBayesClassifier, I keep getting same probabilities inputs correct?


so I'm working on a project its for class "homework" if you will, but what it does is it takes in anime names and genres and if they are relevant or irrelevant I am trying to build a NaiveBayesClassifier with that and then I want to pass in genres and for it to tell me if it is relevant or irrelevant I currently have the following:

import nltk
trainingdata =[({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(trainingdata)
classifier.classify({'Fantasy': True, 'Comedy': True, 'Supernatural': True})
prob_dist = classifier.prob_classify(anime)
print "relevant " + str(prob_dist.prob("relevant"))
print "unrelevant " + str(prob_dist.prob("unrelevant"))

I currently have :

size of training array:110
the relevant length 57
the unrelevant length 53

Some results I receive :

relevant Tantei Opera Milky Holmes TD
input data passed to classify: {'Mystery': True, 'Comedy': True, 'Super': True, 'Power': True}
relevant 0.518018018018
unrelevant 0.481981981982

relevant Juuou Mujin no Fafnir
input data passed to classify :{'Romance': True, 'Fantasy': True, 'School': True}
relevant 0.518018018018
unrelevant 0.481981981982

So it looks like it's not reading my data correctly as 57/110 = .518018 But Im not sure what I am doing wrong...

I looked at this nltk NaiveBayesClassifier training for sentiment analysis

and i feel like I am doing it correctly.. The only thing I am not doing is specifying every specific key that isn't found in keys. Does that matter?

Thanks!


Solution

  • Some background, the OP purpose is to build a classifier for this purpose: https://github.com/alejandrovega44/CSCE-470-Anime-Recommender

    Firstly, there are several methodological issues, i terms of what you're calling things.

    You training data should be the raw data you're using for your task, i.e. the json file at: https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/UserAnime2

    And the data structure that you've in your question should be called a feature vector, i.e. :

    ({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant')
    ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')
    

    The features in the training set in your sample code:

    'drama'
    'mystery'
    'horror'
    'psychological'
    'fantasy',
    'romance', 
    'adventure',
    'science fiction'
    

    But the features in your test set in your sample code are:

    'Fantasy'
    'Comedy'
    'Supernatural'
    'Mystery'
    'Comedy'
    'Super'
    'Power'
    'Romance'
    'Fantasy'
    'School'
    

    Because strings are case sensitive, none of your feature in the test data occurs in your training data. Hence the default probability assigned would be 50%-50% for a binary class, i.e.:

    import nltk
    feature_vectors =[
    ({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), 
    ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
    classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(feature_vectors)
    prob_dist = classifier.prob_classify({'Fantasy': True, 'Comedy': True, 'Supernatural': True})
    print "relevant " + str(prob_dist.prob("relevant"))
    print "unrelevant " + str(prob_dist.prob("unrelevant"))
    

    [out]:

    relevant 0.5
    unrelevant 0.5
    

    Even if you give the same documents but with capitalized features, the classifier won't know, e.g.:

    import nltk
    feature_vectors =[
    ({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), 
    ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
    classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(feature_vectors)
    
    doc1 = {'drama': True, 'mystery': True, 'horror': True, 'psychological': True}
    prob_dist = classifier.prob_classify(doc1)
    print "relevant " + str(prob_dist.prob("relevant"))
    print "unrelevant " + str(prob_dist.prob("unrelevant"))
    print '----'
    caps_doc1 = {'Drama': True, 'Mystery': True, 'Horror': True, 'Psychological':True }
    prob_dist = classifier.prob_classify(caps_doc1)
    print "relevant " + str(prob_dist.prob("relevant"))
    print "unrelevant " + str(prob_dist.prob("unrelevant"))
    print '----'
    

    [out]:

    relevant 0.964285714286
    unrelevant 0.0357142857143
    ----
    relevant 0.5
    unrelevant 0.5
    ----
    

    Without giving more details and a better sample code to debug, this is all we can help on the question. =(