I am having an issue training my Naive Bayes Classifier. I have a feature set and targets that I want to use but I keep getting errors. I've had a look at other people who have similar problems but I can't seem to figure out the issue. I'm sure there's a simple solution but I'm yet to find it.
Here's an example of the structure of the data that I'm trying to use to train the classifier.
In [1] >> train[0]
Out[1] ({
u'profici': [False],
u'saver': [False],
u'four': [True],
u'protest': [False],
u'asian': [True],
u'upsid': [False],
.
.
.
u'captain': [False],
u'payoff': [False],
u'whose': [False]
},
0)
Where train[0] is the first tuple in a list and contains:
A dictionary of features and boolean values to indicate the presence or absence of words in document[0]
The target label for the binary classification of document[0]
Obviously, the rest of the train list has the features and labels for the other documents that I want to classify.
When running the following code
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
MNB_clf = SklearnClassifier(MultinomialNB())
MNB_clf.train(train)
I get the error message:
TypeError: float() argument must be a string or a number
Edit:
features are created here. From a dataframe post_sent that contains the posts in column 1 and the sentiment classification in column 2.
stopwords = set(stopwords.words('english'))
tokenized = []
filtered_posts = []
punc_tokenizer = RegexpTokenizer(r'\w+')
# tokenizing and removing stopwords
for post in post_sent.post:
tokenized = [word.lower() for word in.
punc_tokenizer.tokenize(post)]
filtered = ([w for w in tokenized if not w in stopwords])
filtered_posts.append(filtered)
# stemming
tokened_stemmed = []
for post in filtered_posts:
stemmed = []
for w in post:
stemmed.append(PorterStemmer().stem_word(w))
tokened_stemmed.append(stemmed)
#frequency dist
all_words =.
list(itertools.chain.from_iterable(tokened_stemmed))
frequency = FreqDist(all_words)
# Feature selection
word_features = list(frequency.keys())[:3000]
# IMPORTANT PART
#######################
#------ featuresets creation ---------
def find_features(list_of_posts):
features = {}
wrds = set(post)
for w in word_features:
features[w] = [w in wrds]
return features
# zipping inputs with targets
words_and_sent = zip(tokened_stemmed,
post_sent.sentiment)
# IMPORTANT PART
##########################
# feature sets created here
featuresets = [(find_features(words), sentiment) for
words, sentiment in
words_and_sent]
You are setting the train wrong. As @lenz said in comment, remove the parentheses in the feature dict values and only use single values.
As given in the official documentation:
labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.
But you are setting the mapping (value of key in dict) as a list.
You correct train should look like :
[({u'profici':False,
u'saver':False,
u'four':True,
u'protest':False,
u'asian':True,
u'upsid':False,
.
.
}, 0),
..
..
({u'profici':True,
u'saver':False,
u'four':False,
u'protest':False,
u'asian':True,
u'upsid':False,
.
.
}, 1)]
You can take a look at more examples here: - http://www.nltk.org/howto/classify.html