Search code examples
pythonpython-3.xspacyspacy-3

spacy default english tokenizer changes when re-assigned


When you assign the tokenizer in spacy's (v3.0.5) english language model en_core_web_sm its own default tokenizer it changes its behaviour.

You would expect no change, but it silently fails. Why is this?

Code to reproduce:

import spacy

text = "don't you're i'm we're he's"

# No tokenizer assignment, everything is fine
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ['do', "n't", 'you', 'be', 'I', 'be', 'we', 'be', 'he', 'be']

# Default Tokenizer assignent, tokenization and therefore lemmatization fails
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ["don't", "you're", "i'm", "we're", "he's"]

Solution

  • To create a true default tokenizer it is necessary to pass all defaults to the tokenizer class, not just the vocab:

    from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
    
    rules = nlp.Defaults.tokenizer_exceptions
    infix_re = compile_infix_regex(nlp.Defaults.infixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
    
    tokenizer = spacy.tokenizer.Tokenizer(
            nlp.vocab,
            rules = rules,
            prefix_search=prefix_re.search,
            suffix_search=suffix_re.search,
            infix_finditer=infix_re.finditer,
        )