Search code examples
pythonnlpspacy

spacy matcher for number-noun / number number noun


Hi I tried to match the words using spacy for the texts like

1 cups 1 1/2 cups 1 1/2-inch

To achieve this, I created matcher pattern as below.

pattern1 = [{'POS':'NUM'},
           {'POS':'NUM','OP':'?'},{'POS':'NOUN'},];
# number number noun pattern

pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},
           {'POS': 'NOUN'}];
# number after number but optional to cover both '2 inch' and '2 1/2 inch' 
# it should also cover '2 1/2-inch' so put 'ORTH':'-' but optional

However, when I run the matcher, it only returns one pattern which is number followed by noun like below.

matcher.add('Measurepattern',None,pattern1)
matcher.add('Measurepattern',None,pattern2)

matches=matcher(test_token)

matches

for token,start,end in matches:
    print(test_token[start:end])

//2 teaspoons
//1 teaspoon
//1 cup

Why is that and how do I fix this?

Thank you


Solution

  • In Spacy 2.3.2, 1 1/2-inch is tokenized as ('1', 'NUM'), ('1/2-inch', 'NUM'), so there will be no match with your current patterns if you do not introduce a new, specific pattern.

    Here is an example one: pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];. The regex matches a token whose text starts with one or more digits, then has an optional sequence of / and one or more digits and then has a - and then any one or more word chars (letters, digits or _). You may replace \w with [^\W\d_] to match only letters.

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    
    pattern1 = [{'POS':'NUM'}, {'POS':'NUM', 'OP':'?'}, {'POS':'NOUN'}];
    pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},{'POS': 'NOUN'}];
    pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];
    
    matcher.add("HelloWorld", [pattern1, pattern2, pattern3])
    
    doc = nlp("1 cups, 1 1/2 cups, 1 1/2-inch")
    print([(t.text, t.pos_) for t in doc])
    #[('1', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2-inch', 'NUM')]
    
    matches = matcher(doc)
    spans = [doc[start:end] for _, start, end in matches]
    print(spacy.util.filter_spans(spans))
    ## => [1 cups, 1 1/2 cups, 1 1/2-inch]