I’m using the coreNLP tools from the command line to tag some files containing text in German. I need to get the token, pos, lemma and ner annotations. For this purpose I’m using the following command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -filelist $dir/filelist.input -outputFormat conll --add-modules java.se.ee -ner.useSUTime 0 -outputFormatOptions word,pos,lemma,ner -outputDirectory $dir/tagged_articles -replaceExtension -props StanfordCoreNLP-german.properties
However, the lemmas I’m getting are just not right. Here is an example of a tagged file:
Auch ADV auch O
eine ART eine O
ausgereifte ADJA ausgereifte O
Technik NN technik O
kann VMFIN kann O
jedoch ADV jedoch O
an APPR a O
ihre PPOSAT ihre O
Grenzen NN grenzen O
stoßen VVINF stoßen O
The lemmas for some of those words should be: ist -> sein / Textmengen -> Textmenge / enormen -> enorm / Grenzen -> Grenze. So there is obviously something wrong and I’m wondering what it could be. Any hint is highly appreciated!
I am using the following German model: stanford-german-corenlp-2018-02-27-models.jar
According to the README file, the version of the coreNLP tools is "2018-02-27 3.9.1”
java version "10.0.1" 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
By now, lemmas are only supported for English:
You could try using a different lemmatizer and add the lemmas manually.