Search code examples
pythonnlpspacylemmatization

How to lemmatize Norwegian using spaCy?


I'm doing the following:

from spacy.lang.nb import Norwegian
nlp = Norwegian()
doc = nlp(u'Jeg heter Marianne Borgen og jeg er ordføreren i Oslo.')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop)

Lemmatization seems to not work at all, as this is the output:

(u'Jeg', u'Jeg', u'', u'', u'', u'Xxx', True, False)
(u'heter', u'heter', u'', u'', u'', u'xxxx', True, False)
(u'Marianne', u'Marianne', u'', u'', u'', u'Xxxxx', True, False)
(u'Borgen', u'Borgen', u'', u'', u'', u'Xxxxx', True, False)
(u'og', u'og', u'', u'', u'', u'xx', True, True)
(u'jeg', u'jeg', u'', u'', u'', u'xxx', True, True)
(u'er', u'er', u'', u'', u'', u'xx', True, True)
(u'ordf\xf8reren', u'ordf\xf8reren', u'', u'', u'', u'xxxx', True, False)
(u'i', u'i', u'', u'', u'', u'x', True, True)
(u'Oslo', u'Oslo', u'', u'', u'', u'Xxxx', True, False)
(u'.', u'.', u'', u'', u'', u'.', False, False)

However, looking at https://github.com/explosion/spaCy/blob/master/spacy/lang/nb/lemmatizer/_verbs_wordforms.py, the verb heter should at least be transformed into hete.

So it looks like spaCy has support, but it's not working? What could be the problem?


Solution

  • The lemmatization does in fact work for Norwegian as it's specified in the docs: all forms in lookup.py are lemmatized. Try for instance doc = nlp(u'ei') and you'll see that the lemma of ei is en.

    Now, the file you are referring to, verbs_wordforms.py, documents exceptions in case the part-of-speech (POS) tag is a verb. However, the blank model Norwegian() does not have a POS tagger and so that particular exception for heter is never triggered.

    So the solution is either to use a model which has a POS tagger, or to add your specific exceptions to lookup.py. You'll see for instance that if you'd add there the line 'heter': 'hete', that your blank model would find hete as lemma for heter.

    Finally, note that there's been a lot of work and discussion about publishing a pre-trained Norwegian model in spaCy - but it looks like that is still a bit of a work in progress.