I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data. Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options. HTH