How can I add a specific substring to tokenize on in spaCy?

I am using spaCy to tokenize a string, and the string is likely to contain a specific substring. If the substring is present, I would like spaCy to treat the substring as a token, regardless of any other rules it has. I would like to keep all other rules intact. Is this possible?

To provide a concrete example, suppose the substring of interest is 'banana'; I want 'I like bananabread.' to be tokenized as ['I', 'like', 'banana', 'bread', '.'].

Where do I go from here (keeping in mind that I would like to keep the rest of the tokenizer rules intact)? I have tried adding 'banana' to the prefixes, suffixes, and infixes, with no success.

Solution

Adding the string as a prefix, suffix, and infix should work, but depending on which version of spacy you're using, you may have run into a caching bug while testing. This bug is fixed in v2.2+.

With spacy v2.3.2:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "I like bananabread."
assert [t.text for t in nlp(text)] == ['I', 'like', 'bananabread', '.']

prefixes = ("banana",) + nlp.Defaults.prefixes
suffixes = ("banana",) + nlp.Defaults.suffixes
infixes = ("banana",) + nlp.Defaults.infixes

prefix_regex = spacy.util.compile_prefix_regex(prefixes)
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
infix_regex = spacy.util.compile_infix_regex(infixes)

nlp.tokenizer.prefix_search = prefix_regex.search
nlp.tokenizer.suffix_search = suffix_regex.search
nlp.tokenizer.infix_finditer = infix_regex.finditer

assert [t.text for t in nlp(text)]  == ['I', 'like', 'banana', 'bread', '.']

(In v2.1 or earlier, the tokenizer customization still works on a newly loaded nlp, but if you've already processed some texts with the nlp pipeline and then modify the settings, the bug was that it would use the stored tokenization from the cache rather than the new settings.)