Search code examples
pythonnlptokenizespacy

Prevent Spacy tokenizer from splitting on specific character


When using spacy to tokenize a sentence, I want it to not split into tokens on /

Example:

import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
    print(i)

Output:

Get
10ct
/
liter
off
when
using
our
App

I want it to be like Get , 10ct/liter, off, when ....

I was able to find how to add more ways to split into tokens for spacy, but not how to avoid specific splitting techniques.


Solution

  • I suggest using a custom tokenizer, see Modifying existing rule sets:

    import spacy
    from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
    from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
    from spacy.util import compile_infix_regex
    
    nlp = spacy.load("en_core_web_trf")
    text = "Get 10ct/liter off when using our App"
    # Modify tokenizer infix patterns
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
            r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
        ]
    )
    
    infix_re = compile_infix_regex(infixes)
    nlp.tokenizer.infix_finditer = infix_re.finditer
    doc = nlp(text)
    print([t.text for t in doc])
    ## =>  ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']
    

    Note the commented #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), line, I simply took out the / char from the [:<>=/] character class. This rule split at / that is between a letter/digit and a letter.

    If you need to still split '12/ct' into three tokens, you will need to add another line below the r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) line:

    r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),