Search code examples
pythonnlptokenizespacy

How can I add a specific substring to tokenize on in spaCy?


I am using spaCy to tokenize a string, and the string is likely to contain a specific substring. If the substring is present, I would like spaCy to treat the substring as a token, regardless of any other rules it has. I would like to keep all other rules intact. Is this possible?

To provide a concrete example, suppose the substring of interest is 'banana'; I want 'I like bananabread.' to be tokenized as ['I', 'like', 'banana', 'bread', '.'].

Where do I go from here (keeping in mind that I would like to keep the rest of the tokenizer rules intact)? I have tried adding 'banana' to the prefixes, suffixes, and infixes, with no success.


Solution

  • Adding the string as a prefix, suffix, and infix should work, but depending on which version of spacy you're using, you may have run into a caching bug while testing. This bug is fixed in v2.2+.

    With spacy v2.3.2:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    
    text = "I like bananabread."
    assert [t.text for t in nlp(text)] == ['I', 'like', 'bananabread', '.']
    
    prefixes = ("banana",) + nlp.Defaults.prefixes
    suffixes = ("banana",) + nlp.Defaults.suffixes
    infixes = ("banana",) + nlp.Defaults.infixes
    
    prefix_regex = spacy.util.compile_prefix_regex(prefixes)
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    infix_regex = spacy.util.compile_infix_regex(infixes)
    
    nlp.tokenizer.prefix_search = prefix_regex.search
    nlp.tokenizer.suffix_search = suffix_regex.search
    nlp.tokenizer.infix_finditer = infix_regex.finditer
    
    assert [t.text for t in nlp(text)]  == ['I', 'like', 'banana', 'bread', '.']
    

    (In v2.1 or earlier, the tokenizer customization still works on a newly loaded nlp, but if you've already processed some texts with the nlp pipeline and then modify the settings, the bug was that it would use the stored tokenization from the cache rather than the new settings.)