Is there a way to force spacy not to parse punctuation as separate tokens ???
nlp = spacy.load('en')
doc = nlp(u'the $O is in $R')
[ w for w in doc ]
: [the, $, O, is, in, $, R]
I want :
: [the, $O, is, in, $R]
Customize the prefix_search
function for the spaCy's Tokenizer class. Refer documentation. Something like:
import spacy
import re
from spacy.tokenizer import Tokenizer
# use any currency regex match as per your requirement
prefix_re = re.compile('''^\$[a-zA-Z0-9]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'the $O is in $R')
print([t.text for t in doc])
# ['the', '$O', 'is', 'in', '$R']