Search code examples
lemmatizationspacy-3

Switch spacy lemmatizer's mode for french language


With Spacy, I want to change the lemmatizer of the French model ('rule-based' by default) to 'lookup'.

I'm using spacy 3.6.1, fr_core_news_lg-3.6.0 model and spacy-lookups-data 1.0.5

This seemed to be the only way to do so :

_, _ = nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()

But I've found that it breaks my Matcher. Here is a short code to reproduce the problem.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_lg')

# _, _ = nlp.remove_pipe("lemmatizer")
# nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
# nlp.initialize()

matcher = Matcher(nlp.vocab)

n = [{"POS": "NOUN"}]
matcher.add("NOUN", [n])

matcher(nlp('Cheval'))

=> Works, the output is something like [(92, 0, 1)]

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_lg')

_, _ = nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()

matcher = Matcher(nlp.vocab)

n = [{"POS": "NOUN"}]
matcher.add("NOUN", [n])

matcher(nlp('Cheval'))

=> Doesn't work, the output is [] and I have this warning:
UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = self.matcher(doc, allow_missing=True, as_spans=False)

It looks like removing the lemmatizer also removes other components because for the English language I have this error
ValueError: [E155] The pipeline needs to include a morphologizer or tagger+attribute_ruler in order to use Matcher or PhraseMatcher with the attribute


Solution

  • nlp.initialize() wipes out the weights and settings for all the components in the pipeline (as if you wanted to start over and train all the components in the pipeline from scratch).

    Instead you want:

    nlp.remove_pipe("lemmatizer")
    lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
    lemmatizer.initialize()