Search code examples
tokenizespacy

spaCy: custom infix regex rule to split on `:` for patterns like mailto:[email protected] is not applied consistently


With the default tokenizer, spaCy treats mailto:[email protected] as one single token.

I tried the following:

nlp = spacy.load('en_core_web_lg') infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', ) nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:[email protected], it does what I want:

nlp("mailto:[email protected]")
# [mailto, :, [email protected]]

However, if I apply the tokenizer to mailto:[email protected], it does not work as intended.

nlp("mailto:[email protected]")
# [mailto:[email protected]]

I wonder if there is a way to fix this inconsistency?


Solution

  • There's a tokenizer exception pattern for URLs, which matches things like mailto:[email protected] as one token. It knows that top-level domains have at least two letters so it matches gmail.co and gmail.com but not gmail.c.

    You can override it by setting:

    nlp.tokenizer.token_match = None
    

    Then you should get:

    [t.text for t in nlp("mailto:[email protected]")]
    # ['mailto', ':', '[email protected]']
    
    [t.text for t in nlp("mailto:[email protected]")]
    # ['mailto', ':', '[email protected]']
    

    If you want the URL tokenization to be as by default except for mailto:, you could modify the URL_PATTERN from lang/tokenizer_exceptions.py (also see how TOKEN_MATCH is defined right below it) and use that rather than None.