I'm doing the following:
from spacy.lang.nb import Norwegian
nlp = Norwegian()
doc = nlp(u'Jeg heter Marianne Borgen og jeg er ordføreren i Oslo.')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop)
Lemmatization seems to not work at all, as this is the output:
(u'Jeg', u'Jeg', u'', u'', u'', u'Xxx', True, False)
(u'heter', u'heter', u'', u'', u'', u'xxxx', True, False)
(u'Marianne', u'Marianne', u'', u'', u'', u'Xxxxx', True, False)
(u'Borgen', u'Borgen', u'', u'', u'', u'Xxxxx', True, False)
(u'og', u'og', u'', u'', u'', u'xx', True, True)
(u'jeg', u'jeg', u'', u'', u'', u'xxx', True, True)
(u'er', u'er', u'', u'', u'', u'xx', True, True)
(u'ordf\xf8reren', u'ordf\xf8reren', u'', u'', u'', u'xxxx', True, False)
(u'i', u'i', u'', u'', u'', u'x', True, True)
(u'Oslo', u'Oslo', u'', u'', u'', u'Xxxx', True, False)
(u'.', u'.', u'', u'', u'', u'.', False, False)
However, looking at https://github.com/explosion/spaCy/blob/master/spacy/lang/nb/lemmatizer/_verbs_wordforms.py, the verb heter should at least be transformed into hete.
So it looks like spaCy has support, but it's not working? What could be the problem?
The lemmatization does in fact work for Norwegian as it's specified in the docs: all forms in lookup.py are lemmatized. Try for instance doc = nlp(u'ei')
and you'll see that the lemma of ei
is en
.
Now, the file you are referring to, verbs_wordforms.py
, documents exceptions in case the part-of-speech (POS) tag is a verb. However, the blank model Norwegian()
does not have a POS tagger and so that particular exception for heter
is never triggered.
So the solution is either to use a model which has a POS tagger, or to add your specific exceptions to lookup.py
. You'll see for instance that if you'd add there the line 'heter': 'hete',
that your blank model would find hete
as lemma for heter
.
Finally, note that there's been a lot of work and discussion about publishing a pre-trained Norwegian model in spaCy - but it looks like that is still a bit of a work in progress.