I want to train the Stanford tagger using a corpus which consists of multiple files and will be extended in the future.
Is it possible to update an existant model or do I have to train using the entire corpus every time?
Are there any examples of how to do the training using the API? The JavaDoc of MaxentTagger only covers training via command line.
Thank you!
At present, you have to train using the entire corpus every time. (Updating a model with additional data is theoretically possible, but it's not something that currently exists and it isn't on our front burner.)
We do all our training of models from the command line.... Actually, looking at the code, it seems like the train method is private, so you'd need to make it more public to be able to do training from the API. We should fix that. Might try to do this.
If the access level was different, you could create a TaggerConfig and then call this method:
private static void trainAndSaveModel(TaggerConfig config) throws IOException { ... }
But, even then, it currently always saves its built tagger to disk. So, things could do with a bit of reworking to enable this smoothly.