Search code examples
javamachine-learningclassificationdocument-classificationcategorization

text categorization classifiers


Does anybody know of good open-source text-categorization models? I know about Stanford Classifier, Weka, Mallet, etc. but all of them require training.

I need to classify news articles into Sports/Politics/Health/Gaming/etc. Is there any pre-trained models out there?

Alchemy, OpenCalais, etc. are not options. I need open-source tools (preferably in Java).


Solution

  • Having a pre-trained model assumes that the corpus that was used to train is from the exact same domain as the documents you are trying to classify. Generally this is not going to give you the results you want because you don't have the original corpus. Machine learning is not static, when you train a classifier you need to update the model when new features/information becomes available.

    Take for example classifying news articles like you want in the domain of Sports/Politics/Health/Gaming/etc.

    First what language? Are we talking about english only? How was the original corpus labeled? And the biggest unknown is the etc. category.

    Training your own classifier is really really easy. If you are classifying text, MALLET is the best choice. You can be up and running in lest than 10 minutes. You can add MALLET into your own application in under 1 hour.

    If you want to classify news articles there are a lot of open source corpora that you can use as a base to start training. I would start with Reuters-21578 or RCV-1.