Spacy - modify tokenizer for numeric patterns

I have seen some ways to create a custom tokenizer, but I am a little confused. What I am doing is using the Phrase Matcher to match patterns. However, it would match a 4-digit number pattern, say 1234, in 111-111-1234, since it splits on the dash.

All I want to do is modify the current tokenizer (from nlp = English()) and add a rule that it should not split on some characters but only for numeric patterns.

Solution

To do this you will need to overwrite spaCy's default infix tokenization scheme with your own. You can do this by modifying the infix tokenization scheme used by spaCy found here.

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

# default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])

# modify tokenizer infix patterns
infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\*^](?=[0-9-])", # Remove the hyphen
            r"(?<=[{al}{q}])\.?(?=[{au}{q}])".format( # Make the dot optional
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            )
            ,
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])

Output

With default tokenizer:
['111', '-', '222', '-', '1234', 'for', 'abcDE']

With custom tokenizer:
['111-222-1234', 'for', 'abc', 'DE']