Search code examples
pythonnltkdocument-classification

document classification using naive bayes in python


I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?


Solution

  • A few points that might help:

    • Don't use a stoplist, it lowers accuracy (but do remove punctuation)
    • Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
    • Use bigrams as well as unigrams - this will up the accuracy a bit.

    You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!