Search code examples
pythontokenizespacy

Spacy - modify tokenizer for numeric patterns


I have seen some ways to create a custom tokenizer, but I am a little confused. What I am doing is using the Phrase Matcher to match patterns. However, it would match a 4-digit number pattern, say 1234, in 111-111-1234, since it splits on the dash.

All I want to do is modify the current tokenizer (from nlp = English()) and add a rule that it should not split on some characters but only for numeric patterns.


Solution

  • To do this you will need to overwrite spaCy's default infix tokenization scheme with your own. You can do this by modifying the infix tokenization scheme used by spaCy found here.

    import spacy
    from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
    from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
    from spacy.util import compile_infix_regex
    
    # default tokenizer
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("111-222-1234 for abcDE")
    print([t.text for t in doc])
    
    # modify tokenizer infix patterns
    infixes = (
            LIST_ELLIPSES
            + LIST_ICONS
            + [
                r"(?<=[0-9])[+\*^](?=[0-9-])", # Remove the hyphen
                r"(?<=[{al}{q}])\.?(?=[{au}{q}])".format( # Make the dot optional
                    al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
                )
                ,
                r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
                r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
                r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
            ]
    )
    
    infix_re = compile_infix_regex(infixes)
    nlp.tokenizer.infix_finditer = infix_re.finditer
    doc = nlp("111-222-1234 for abcDE")
    print([t.text for t in doc])
    

    Output

    With default tokenizer:
    ['111', '-', '222', '-', '1234', 'for', 'abcDE']
    
    With custom tokenizer:
    ['111-222-1234', 'for', 'abc', 'DE']