python machine-learning bayesian scikits scikit-learn

Naive Bayes classifier using python

I'm using scikit-learn for finding the Tf-idf weight of a document and then using the Naive
Bayesian classifier to classify the text. But the Tf-idf weight of all words in a documents are negative except a few. But as far as I know, negative values means unimportant terms. So is it necessary to pass the whole Tf-idf values to the bayesian classifier? If we only need to pass only a few of those, how can we do it? Also how better or worse is a bayesian classifier compared to a linearSVC? Is there a better way to find tags in a text other than using Tf-idf ?

Thanks

Solution

You have a lot of questions there but I'll try to help.

As far as I remember, TF-IDF should not be a negative value. TF is the term frequency (how often a term appears in a particular document) and the inverse document frequency (# of documents in corpus / # of documents that include the term). That's then usually log weighted. We often add one to the denominator as well to avoid division by zero. Hence, the only time you would get a negative tf*idf is if the term appears in every single document of the corpus (which is not very helpful to search on as you mentioned since it doesn't add information). I would double check your algorithm.

given term t, document d, corpus c:

tfidf = term freq * log(document count / (document frequency + 1))
tfidf = [# of t in d] * log([#d in c] / ([#d with t in c] + 1))

In machine learning naive bayes and SVMs are both good tools--their quality will vary depending on the application and I've done projects where their accuracy turned out to be comparable. Naive Bayes is usually pretty easy to hack together by hand--I'd give that a shot first before venturing to SVM libraries.

I might be missing something but I'm not quite confident I know exactly what you're looking for--Happy to modify my answer.