Search code examples
pythontokenizespacypunctuation

Force spacy not to parse punctuation?


Is there a way to force spacy not to parse punctuation as separate tokens ???

 nlp = spacy.load('en')

 doc = nlp(u'the $O is in $R')

  [ w for w in doc ]
  : [the, $, O, is, in, $, R]

I want :

  : [the, $O, is, in, $R]

Solution

  • Customize the prefix_search function for the spaCy's Tokenizer class. Refer documentation. Something like:

    import spacy
    import re
    from spacy.tokenizer import Tokenizer
    
    # use any currency regex match as per your requirement
    prefix_re = re.compile('''^\$[a-zA-Z0-9]''')
    
    def custom_tokenizer(nlp):
        return Tokenizer(nlp.vocab, prefix_search=prefix_re.search)
    
    nlp = spacy.load("en_core_web_sm")
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp(u'the $O is in $R')
    print([t.text for t in doc])
    
    # ['the', '$O', 'is', 'in', '$R']