Search code examples
machine-learningnlplinguistics

Lemmatizer supporting german language (for commercial and research purpose)


I am searching for a lemmatization software which:

  • supports the german language
  • has a license that allows it to be used for commercial and research purpose. LGPL license would be good.
  • should preferably be implemented in Java. Implementations in other programming languages would also be OK.

Does anybody know about such a lemmatizer?

Regards,

UPDATE: Hi Daniel, At first, thank you for the great work you are providing with the LanguageTool.

We would like to index german Texts into elasticsearch (ES) and pre-analyze the texts using either an ES-built-in german stemmer (please see https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html) or the following plugin https://github.com/jprante/elasticsearch-analysis-baseform. The latter uses your morphology file under http://www.danielnaber.de/morphologie/morphy-mapping-20110717.latin1.gz and that is why I thought you maybe have some evaluation data in order to know what is the trade-off when using the lemmatization based on your morphology file instead of an ES-built-in stemmer. Do you maybe have some figures in terms of precision/ coverage of your german morphology? Or comparative data with the german stemmers used in Elasticsearch?

Best regards


Solution

  • LanguageTool can do that (disclaimer: I'm the maintainer of LanguageTool), it's available under LGPL and implemented in Java. You could use GermanTagger.tag(), the result can have more than one reading (as language is often ambiguous), and each reading's AnalyzedToken finally has a lemma.