I don't understand why when i'm doing this
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
Tokens don't have any morph assigned
print([tok.morph for tok in doc])
> ['','','','','']
Is this behavior expected? If yes, why ? (spacy v3.0.7)
The pipeline expects nlp.vocab
and nlp.tokenizer.vocab
to refer to the exact same Vocab
object, which isn't the case after running deepcopy
.
I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis
objects, which are stored centrally in the vocab in vocab.morphology
, end up out-of-sync between the two vocabs.