Search code examples
pythonnlpspacy-3

Spacy Rule-Based Matching outputs undesired phrase bit


I was reproducing a Spacy rule-matching example:

import spacy 
from spacy.matcher import Matcher 

nlp = spacy.load("en_core_web_md")
doc = nlp("Good morning, I'm here. I'll say good evening!!")
pattern = [{"LOWER": "good"},{"LOWER": {"IN": ["morning", "evening"]}},{"IS_PUNCT": True}] 
matcher.add("greetings", [pattern]) # good morning/evening with one pattern with the help of IN as follows
matches = matcher(doc)
for mid, start, end in matches:
    print(start, end, doc[start:end])

which is supposed to match

Good morning  good evening!

But the above code also matches "I" in both occasions

0 3 Good morning,
3 4 I
7 8 I
10 13 good evening!

I just want to remove the "I" from the Matching

Thank you


Solution

  • When I run your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4 with both the en_core_web_sm and en_core_web_trf pipelines), it produces a NameError because matcher is not defined. After defining matcher as an instantiation of the Matcher class in accordance with the spaCy Matcher documentation, I get the following (desired) output with both pipelines:

    0 3 Good morning,
    10 13 good evening!
    

    The full working code is shown below. I'd suggest restarting your IDE and/or computer if you're still seeing your unexpected results.

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("Good morning, I'm here. I'll say good evening!!")
    matcher = Matcher(nlp.vocab)
    pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}, {"IS_PUNCT": True}]
    matcher.add("greetings", [pattern])  # good morning/evening with one pattern with the help of IN as follows
    matches = matcher(doc)
    for match_id, start, end in matches:
        print(start, end, doc[start:end])