Search code examples
pythonnlptokenizespacy

With spacy, how to make sure a sequence of letters is never split into tokens


I'm searching for a way to make sure any time the sequence "#*" appears in the text, spacy gives me the token "#*". I tried every possible way of adding special cases with add_special_case, building a custom Tokenizer using prefix_search, suffix_search, infix_finditer and token_match, but there are still cases when if a "#*" appears in a sentence, even when it's surrounded by tokens that are not weird (tokens that should be recognized without a problem), the "#*" is splitted into [#, *]. What can I do?

Thanks.


Solution

  • Spacy's current handling of special cases that contain characters that are otherwise prefixes or suffixes isn't ideal and isn't quite what you'd expect in all cases.

    This would be a bit easier to answer with examples of what the text looks like and where the tokenization isn't working, but:

    If #* is always surrounded by whitespace, a special case should work:

    nlp.tokenizer.add_special_case("#*", [{"ORTH": "#*"}])
    print([t.text for t in nlp("a #* a")]) # ['a', '#*', 'a']
    

    If #* should be tokenized as if it is a word like to, one option is remove # and * from the prefixes and suffixes and then those characters aren't treated any differently from t or o. Adjacent punctuation would be split off as affixes, adjacent letters/numbers wouldn't be.

    If #* is potentially adjacent to any other characters like #*a or a#*a or "#*", it's probably easiest to add it as a prefix, suffix, and infix, adding it before the default patterns so that the default patterns like # aren't matched first:

    prefixes = ("#\*",) + nlp.Defaults.prefixes
    nlp.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search
    suffixes = ("#\*",) + nlp.Defaults.suffixes
    nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search
    infixes = ("#\*",) + nlp.Defaults.infixes + ("#\*",)
    nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
    
    print([t.text for t in nlp("a#* a#*a #*a '#*'")])
    # ['a', '#*', 'a', '#*', 'a', '#*', 'a', "'", '#*', "'"]
    

    This is a good case for using the new debugging function that was just added the tokenizer (disclaimer: I am the author). With spacy v2.2.3 try:

    nlp.tokenizer.explain('#*')
    

    The output [('PREFIX', '#'), ('SUFFIX', '*')] tells you which patterns are responsible for the resulting tokenization. As you modify the patterns, this function should let you see more easily whether your modifications are working as intended.

    After the modifications in the final example above, the output is:

    nlp.tokenizer.explain("a#* a#*a #*a '#*'")
    # [('TOKEN', 'a'), ('SUFFIX', '#*'), ('TOKEN', 'a'), ('INFIX', '#*'), ('TOKEN', 'a'), ('PREFIX', '#*'), ('TOKEN', 'a'), ('PREFIX', "'"), ('PREFIX', '#*'), ('SUFFIX', "'")]