algorithm machine-learning data-mining document-classification

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.

I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.

Solution

Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.

http://opennlp.sourceforge.net/api/index.html

Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.

You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.

Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.