python machine-learning scikit-learn nlp

Text classification. TFIDF and Naive Bayes?

I'm attempting a text classification task, where I have training data of around 500 restaurant reviews that are labelled across 12 categories. I spent longer than I should have implementing TF.IDF and cosine similarity for the classification of test data, only to get some very poor results (0.4 F-measure). With time not on my side now, I need to implement something significantly more effective that doesn't have a steep learning curve. I am considering using the TF.IDF values in conjunction with Naive Bayes. Does this sound sensible? I know if I can get my data in the right format, I can do this with Scikit learn. Is there anything else you recommend I consider?

Solution

You should try to use fasttext: https://pypi.python.org/pypi/fasttext . It can be used to classify text like this:

(don't forget to download a pretrained model here https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip by changing the language if it's not english)

import fasttext

model = fasttext.load_model('wiki.en.bin')  # the name of the pretrained model

classifier = fasttext.supervised('train.txt', 'model', label_prefix='__label__')

result = classifier.test('test.txt')
print ('P@1:', result.precision)
print ('R@1:', result.recall)
print ('Number of examples:', result.nexamples)

Every line in your training and test sets should be like this:

__label__classname Your restaurant review blah blah blah