Search code examples
classificationcategorization

classification using lingpipe


As a part of my academic research project, I am trying to build an application wherein I will have a set of urls retrieved from the web. The task is classify each of these urls into some category.

For Instance, the following URL is regarding cricket http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/story/499851.html If I give this particular URL to the classifier, it should give the output category as "Sports".

For this I am using the lingpipe classifier. I have followed the classification tutorial and ran the demo present in the demo folder. I have downloaded 20 news data set downloaded from the following link. http://people.csail.mit.edu/people/jrennie/20Newsgroups

Later, I have decreased the training sample size from 20 to 8 and have run the classification demo. It could successfully train the data and could test the data also.

But the thing is that, do I need to train the classifier every time I want to test the category of documents? If I run the classification of documents it takes 4 minutes for both training and testing the data.

Can I store the trained data once and perform the classification several times?


Solution

  • You need to serialize the the trained models to disk and then you can deserialize them and have the classifier ready to go.

    Once you have a classifier trained up use

     AbstractExternalizable.compileTo(classifier,modelFile);
    

    To write the model to disk.

    To read in you will need

    AbstractExternalizable.readObject(modelFile);
    

    Look at the Java doc for AbstractExternalizable.

    The model will not be able to accept additional training events because it has been compiled.