Hi I tried to match the words using spacy for the texts like
1 cups 1 1/2 cups 1 1/2-inch
To achieve this, I created matcher pattern as below.
pattern1 = [{'POS':'NUM'},
{'POS':'NUM','OP':'?'},{'POS':'NOUN'},];
# number number noun pattern
pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},
{'POS': 'NOUN'}];
# number after number but optional to cover both '2 inch' and '2 1/2 inch'
# it should also cover '2 1/2-inch' so put 'ORTH':'-' but optional
However, when I run the matcher, it only returns one pattern which is number followed by noun like below.
matcher.add('Measurepattern',None,pattern1)
matcher.add('Measurepattern',None,pattern2)
matches=matcher(test_token)
matches
for token,start,end in matches:
print(test_token[start:end])
//2 teaspoons
//1 teaspoon
//1 cup
Why is that and how do I fix this?
Thank you
In Spacy 2.3.2, 1 1/2-inch
is tokenized as ('1', 'NUM'), ('1/2-inch', 'NUM')
, so there will be no match with your current patterns if you do not introduce a new, specific pattern.
Here is an example one: pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];
. The regex matches a token whose text starts with one or more digits, then has an optional sequence of /
and one or more digits and then has a -
and then any one or more word chars (letters, digits or _
). You may replace \w
with [^\W\d_]
to match only letters.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern1 = [{'POS':'NUM'}, {'POS':'NUM', 'OP':'?'}, {'POS':'NOUN'}];
pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},{'POS': 'NOUN'}];
pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];
matcher.add("HelloWorld", [pattern1, pattern2, pattern3])
doc = nlp("1 cups, 1 1/2 cups, 1 1/2-inch")
print([(t.text, t.pos_) for t in doc])
#[('1', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2-inch', 'NUM')]
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))
## => [1 cups, 1 1/2 cups, 1 1/2-inch]