With the default tokenizer, spaCy treats mailto:[email protected]
as one single token.
I tried the following:
nlp = spacy.load('en_core_web_lg')
infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', )
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:[email protected]
, it does what I want:
nlp("mailto:[email protected]")
# [mailto, :, [email protected]]
However, if I apply the tokenizer to mailto:[email protected]
, it does not work as intended.
nlp("mailto:[email protected]")
# [mailto:[email protected]]
I wonder if there is a way to fix this inconsistency?
There's a tokenizer exception pattern for URLs, which matches things like mailto:[email protected]
as one token. It knows that top-level domains have at least two letters so it matches gmail.co
and gmail.com
but not gmail.c
.
You can override it by setting:
nlp.tokenizer.token_match = None
Then you should get:
[t.text for t in nlp("mailto:[email protected]")]
# ['mailto', ':', '[email protected]']
[t.text for t in nlp("mailto:[email protected]")]
# ['mailto', ':', '[email protected]']
If you want the URL tokenization to be as by default except for mailto:
, you could modify the URL_PATTERN
from lang/tokenizer_exceptions.py
(also see how TOKEN_MATCH
is defined right below it) and use that rather than None
.