I am trying to use Naive Bayes to detect humor in texts. I have this code taken from here but I have some errors and I don't know how to resolve them because I am pretty new to Machine Learning and these algorithms. My train data contains one-liners. I know that others put the same question but I didn't find an answer yet.
import os
import io
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_jokes', 'funny'))
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes', 'notfunny'))
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
examples = ['Where do steers go to dance? The Meat Ball', 'tomorrow I press this button']
examples_counts = vectorizer.transform(examples)
predictions = classifier.predict(examples_counts)
print(predictions)
And the errors:
Traceback (most recent call last):
File "G:/PyCharmProjects/naive_bayes_classifier/NaiveBayesClassifier.py", line 55, in <module>
counts = vectorizer.fit_transform(data['message'].values)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 811, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
Here are some inputs from train_jokes
"[me narrating a documentary about narrators] ""I can't hear what they're saying cuz I'm talking"""
"Telling my daughter garlic is good for you. Good immune system and keeps pests away.Ticks, mosquitos, vampires... men."
I've been going through a really rough period at work this week It's my own fault for swapping my tampax for sand paper.
"If I could have dinner with anyone, dead or alive... ...I would choose alive. -B.J. Novak-"
Two guys walk into a bar. The third guy ducks.
Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo
Why was the musician arrested? He got in treble.
Did you hear about the guy who blew his entire lottery winnings on a limousine? He had nothing left to chauffeur it.
What do you do if a bird shits on your car? Don't ask her out again.
He was a real gentlemen and always opened the fridge door for me
train_jokes
contains about 250.000 one-liners or tweets, and train_non_jokes
contains simple sentences which are not funny. For this moment I don't have the non-funny file ready, just some sentences from Twitter.
The problem was not with the code, but with the train data. First of all, G:/PyCharmProjects/naive_bayes_classifier/train_jokes
and G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes
must be the path to the directories that contain the files with train data (so train_jokes and train_non_jokes are directories). On the other hand, my file contained no new line so the variable inBody
was always false. For the program to work, the train data needed to be like this:
text here and then blank line
another text
and this is it
(I just removed and reference of inBody
and this solved the new line). These are some details that I missed watching that video because he didn't say that. Thank you all for your answers, it helped a lot.