Search code examples
tokenizespacy

spaCy: custom infix regex rule to split on `:` for patterns like mailto:johndoe@gmail.com is not applied consistently


With the default tokenizer, spaCy treats mailto:johndoe@gmail.com as one single token.

I tried the following:

nlp = spacy.load('en_core_web_lg') infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', ) nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:johndoe@gmail.c, it does what I want:

nlp("mailto:johndoe@gmail.c")
# [mailto, :, johndoe@gmail.c]

However, if I apply the tokenizer to mailto:johndoe@gmail.com, it does not work as intended.

nlp("mailto:johndoe@gmail.com")
# [mailto:johndoe@gmail.com]

I wonder if there is a way to fix this inconsistency?


Solution

  • There's a tokenizer exception pattern for URLs, which matches things like mailto:johndoe@gmail.com as one token. It knows that top-level domains have at least two letters so it matches gmail.co and gmail.com but not gmail.c.

    You can override it by setting:

    nlp.tokenizer.token_match = None
    

    Then you should get:

    [t.text for t in nlp("mailto:johndoe@gmail.com")]
    # ['mailto', ':', 'johndoe@gmail.com']
    
    [t.text for t in nlp("mailto:johndoe@gmail.c")]
    # ['mailto', ':', 'johndoe@gmail.c']
    

    If you want the URL tokenization to be as by default except for mailto:, you could modify the URL_PATTERN from lang/tokenizer_exceptions.py (also see how TOKEN_MATCH is defined right below it) and use that rather than None.