Search code examples
pythonmachine-learningscikit-learnnlp

Text classification. TFIDF and Naive Bayes?


I'm attempting a text classification task, where I have training data of around 500 restaurant reviews that are labelled across 12 categories. I spent longer than I should have implementing TF.IDF and cosine similarity for the classification of test data, only to get some very poor results (0.4 F-measure). With time not on my side now, I need to implement something significantly more effective that doesn't have a steep learning curve. I am considering using the TF.IDF values in conjunction with Naive Bayes. Does this sound sensible? I know if I can get my data in the right format, I can do this with Scikit learn. Is there anything else you recommend I consider?


Solution

  • You should try to use fasttext: https://pypi.python.org/pypi/fasttext . It can be used to classify text like this:

    (don't forget to download a pretrained model here https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip by changing the language if it's not english)

    import fasttext
    
    model = fasttext.load_model('wiki.en.bin')  # the name of the pretrained model
    
    classifier = fasttext.supervised('train.txt', 'model', label_prefix='__label__')
    
    result = classifier.test('test.txt')
    print ('P@1:', result.precision)
    print ('R@1:', result.recall)
    print ('Number of examples:', result.nexamples)
    

    Every line in your training and test sets should be like this:

    __label__classname Your restaurant review blah blah blah