Search code examples
pythonnlpspacy

How to modify spacy tokenizer to split URLs into individual words


I want to modify the default tokenizer to split URL's into individual words. Here's what I currently have


import spacy
nlp = spacy.blank('en')
infixes = nlp.Defaults.infixes + [r'\.']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print(list(nlp('www.internet.com'))) 
# ['www.internet.com']
# want it to be ['www', '.', 'internet', '.', 'com']

I'm looking at the usage examples and source code for the tokenizer, but I can't work out this particular case.


Solution

  • You're not seeing results you want because url get caught by URL_MATCH first (it has higher precedence):

    import spacy
    nlp = spacy.blank('en')
    txt = 'Check this out www.internet.com'
    doc = nlp(txt)
    nlp.tokenizer.explain(txt)
    

    [('TOKEN', 'Check'),
     ('TOKEN', 'this'),
     ('TOKEN', 'out'),
     ('URL_MATCH', 'www.internet.com')]
    

    One of the possible solutions:

    nlp.tokenizer.url_match = None
    infixes = nlp.Defaults.infixes + [r'\.']
    infix_regex = spacy.util.compile_infix_regex(infixes)
    nlp.tokenizer.infix_finditer = infix_regex.finditer
    doc = nlp(txt)
    list(doc)
    

    [Check, this, out, www, ., internet, ., com]