document classification using naive bayes in python

I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?

Solution

A few points that might help:

Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.

You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!