Search code examples
spacyspacy-3

Why spacy morphologizer doesn't work when we use a custom tokenizer?


I don't understand why when i'm doing this

import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")

class MyTokenizer:
    def __init__(self, tokenizer):
        self.tokenizer = deepcopy(tokenizer)
    def __call__(self, text):
        return self.tokenizer(text)

nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")

Tokens don't have any morph assigned

print([tok.morph for tok in doc])
> ['','','','','']

Is this behavior expected? If yes, why ? (spacy v3.0.7)


Solution

  • The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.

    I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.