Search code examples
javanlpopennlplemmatization

How to handle LemmatizerTrainer 'UTFDataFormatException: encoded string too long'?


I am using Opennlp to train a model for lemmatization of german words. Therefore I use the opennlp cli and the training set of UD_German-HDT which can be downloaded here

The training itself works fine (just need a little bit of ram) but the cli fails to write the model because of an UTFDataFormatException: encoded string too long exception.

The cli command I am using: opennlp LemmatizerTrainerME.conllu -params params.txt -lang de -model de-lemmatizer.bin -data UD_German-HDT/de_hdt-ud-train.conllu -encoding UTF-8

Stacktrace:

Writing lemmatizer model ... failed
Error during writing model file 'de-lemmatizer.bin'
encoded string too long: 383769 bytes
java.io.UTFDataFormatException: encoded string too long: 383769 bytes
        at java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
        at java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
        at opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71)
        at opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97)
        at opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
        at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
        at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
        at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
        at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
        at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182)
        at opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77)
        at opennlp.tools.cmdline.CLI.main(CLI.java:256)

Has somebody encountered this problem and has a solution?


Solution

  • Recently, I've written a patch to cure OpenNLP-1366. The related PR https://github.com/apache/opennlp/pull/427 documents the problem and solution in detail.

    In this context, the upcoming OpenNLP version 2.0.1 will bring the cure for the problem reported in the OP. Updating to the aforementioned version will resolve the crashing during writing trained model files.

    Note:
    I verified that the patch works with UD_German-HDT, UD_German-GSD, and other treebanks for the German language.