I am using Spacy v2
I looking for dates in a doc , I want that the tokenizer will merge them
For example:
doc= 'Customer: Johnna 26 06 1989'
the default tokenizer results looks like :
('Customer:', 'customer:', 'NUM', 'CD', 'amod', 'Xxxxx:', False, False)
('Johnna', 'Johnna ', 'PROPN', 'NNP', 'ROOT', 'xxxx', True, False)
('26', '26', 'NUM', 'CD', 'compound', 'dd', False, False)
('06', '06', 'NUM', 'CD', 'appos', 'dd', False, False)
('1989', '1989', 'NUM', 'CD', 'nummod', 'dddd', False, False)
While I want it to look like :
('Customer:', 'customer:', 'NUM', 'CD', 'amod', 'Xxxxx:', False, False)
('Johnna', 'Johnna ', 'PROPN', 'NNP', 'ROOT', 'xxxx', True, False)
('26 06 1989', '26', 'NUM', 'CD', 'compound', 'dd dd dd', False, False)
I tried to create customize tokenizer , but I am not sure if I need to change the prefix or the suffix_ and how to define the case.
def __customize_tokenizer(self):
prefix_re = re.compile(r'\d+\s+\d+')
return Tokenizer(self._nlp.vocab, prefix_search = prefix_re.search)
The tokenizer algorithm doesn't support this kind of pattern: it doesn't support regexes in its exceptions and the affix patterns aren't applied across whitespace.
Instead, one option is to find these cases with the Matcher
, which does support regexes, and use the retokenizer to merge the tokens:
import spacy
from spacy.matcher import Matcher
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("DATE", [[{"ORTH": {"REGEX": "\d\d"}}, {"ORTH": {"REGEX": "\d\d"}}, {"ORTH": {"REGEX": "\d\d\d\d"}}]])
text = "This is a date 01 02 2000 in a sentence."
doc = nlp(text)
with doc.retokenize() as retokenizer:
for match_id, start, end in matcher(doc):
print([t.text for t in doc])
# ['This', 'is', 'a', 'date', '01 02 2000', 'in', 'a', 'sentence', '.']
If you want, you can put the matching and retokenization into a custom component at the beginning of your pipeline, see: https://v2.spacy.io/usage/processing-pipelines#custom-components