Search code examples
pythontokenizespacy

Add some custom words to tokenizer in Spacy


I have a sentence and would like to see expected tokens as following.

Sentence: "[x] works for [y] in [z]."
Tokens: ["[", "x", "]", "works", "for", "[", "y", "]", "in", "[", "z", "]", "."]
Expected: ["[x]", "works", "for", "[y]", "in", "[z]", "."]

How can I do this by a custom tokenizer function?


Solution

  • You can remove [ and ] from the tokenizer prefixes and suffixes so that the brackets are not split off from adjacent tokens:

    import spacy
    nlp = spacy.load('en_core_web_sm')
    
    prefixes = list(nlp.Defaults.prefixes)
    prefixes.remove('\\[')
    prefix_regex = spacy.util.compile_prefix_regex(prefixes)
    nlp.tokenizer.prefix_search = prefix_regex.search
    
    suffixes = list(nlp.Defaults.suffixes)
    suffixes.remove('\\]')
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    nlp.tokenizer.suffix_search = suffix_regex.search
    
    doc = nlp("[x] works for [y] in [z].")
    print([t.text for t in doc])
    # ['[x]', 'works', 'for', '[y]', 'in', '[z]', '.']
    

    The relevant documentation is here:

    https://spacy.io/usage/linguistic-features#native-tokenizer-additions