Search code examples
spacylemmatization

Surprising results for German lemmatization in Spacy


I wanted to use the lemmatizer for German in Spacy, but I am very surprised by the results:

import spacy

nlp = spacy.load("de_dep_news_trf")
[token.lemma_ for token in nlp('ich du er sie mein dein sein ihr unser')]

gives

['ich', 'du', 'ich', 'ich', 'meinen', 'mein', 'mein', 'mein', 'sich']

and I am not sure I can use that:

vielen dank für deinen sehr guten tweet

becomes

viel danken für mein sehr gut tweet

which clearly changes the meaning of the sentence.

Is that expected? Am I missing a tuning/configuration that would make that lemmatizer less "aggressive"?


Solution

  • The current (v3.1) default German lemmatizer is just not very good. It's a very simple lookup lemmatizer with some questionable entries in its lookup table, but given license constraints for the German pretrained pipelines, there haven't been other good alternatives. (We do have some internal work-in-progress on a statistical lemmatizer, but I'm not sure when it will make into a release.)

    The best suggestion here if lemmas are important for your task is to use a different lemmatizer. Depending on your task / size / speed / license requirements, you could consider using a German model from spacy-stanza or a third-party library like spacy-iwnlp (currently only for spacy v2, but it's probably not hard to update for v3).