Search code examples
python-3.xmachine-learningscikit-learntext-classificationnaivebayes

TypeError from MultinomialNB: float() argument must be a string or a number


I am trying to compare the performance of Multinomial, Binomial and Bernoulli classifiers but I am having an error:

TypeError: float() argument must be a string or a number, not 'set'

The code below is til MultinomialNB.

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

#print(documents[1])

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def look_for_features(document):
    words = set(document)
    features = {}
    for x in word_features:
        features[x] = {x in words}
    return features

#feature set will be finding features and category
featuresets = [(look_for_features(rev), category) for (rev, category) in documents]

training_set = featuresets[:1400]
testing_set = featuresets[1400:]

#Multinomial
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print ("Accuracy: ", (nltk.classify.accuracy(MNB_classifier,testing_set))*100)

The error seems to be in MNB_classifier.train(training_set). The error in this code is similar to error here.


Solution

  • Change...

    features[x] = {x in words}
    

    to...

    features[x] = x in words
    

    The first line creates a list featuresets of pairs (word, {True}) or (word, {False}), i.e. the second element is a set. SklearnClassifier does not expect this as a label.


    The code looks very much like one from "Creating a module for Sentiment Analysis with NLTK". The author is using a tuple (x in words) there, but it's no different from just x in words.