Using the following Matcher rule:
{'label': 'R-1',
'pattern': [{'TEXT': 'MyLabel'}, {'TEXT': ':', 'OP': '?'}],
'greedy': 'LONGEST', }
on the text: 'MyLabel: Some Value'
I get two matches: 'MyLabel' and 'MyLabel:'
For me, that was quite surprising - I was expecting a single match on 'MyLabel:'. Adding the new greedy flag didn't make any difference.
SpaCy version 3.7.5
i will say that the behavior you're observing with the SpaCy Matcher
is expected, and it is not a bug. When you use the {'TEXT': ':', 'OP': '?'}
pattern, the OP: '?'
operator means that the colon is optional, so the matcher will generate both the shorter and the longer match, as you've seen.
{'TEXT': 'MyLabel'}, {'TEXT': ':', 'OP': '?'}
.'MyLabel: Some Value'
.So for this pattern, SpaCy will try to match:
'MyLabel'
alone (because the colon is optional).'MyLabel:'
(because the colon can be included).Therefore, you will get two matches: 'MyLabel'
and 'MyLabel:'
.
Is this the intended behavior or is it a bug?
OP: '?'
operator allows the colon to be optionally matched, leading to multiple matches.How should I determine that the second match really is just a subset of the first match?
pip show spacy
Name: spacy
Version: 3.7.5
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: [email protected]
License: MIT
Location: /home/adesoji/Downloads/visis-backend-assessment-Adesoji/visisenv/lib/python3.11/site-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-sm
Now Example in code:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("MyLabel: Some Value")
matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': 'MyLabel'}, {'TEXT': ':', 'OP': '?'}]
matcher.add("R-1", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(f"Match: {span.text}, Start: {start}, End: {end}")
# Now, we Determine if one match is a subset of another
matches.sort(key=lambda x: (x[1], -x[2])) # Sort by start index, then by end index descending
filtered_matches = []
last_end = -1
for match_id, start, end in matches:
if start >= last_end: # This is for Avoiding adding subsets
filtered_matches.append((match_id, start, end))
last_end = end
for match_id, start, end in filtered_matches:
span = doc[start:end]
print(f"Filtered Match: {span.text}")
Now, This code will filter out the shorter match and your output will be
Match: MyLabel, Start: 0, End: 1
Match: MyLabel:, Start: 0, End: 2
Filtered Match: MyLabel: , you can see MYLabel: with the colon symbol there
If you want to ensure that only the longest match is returned, you can change the way you define the pattern:
pattern = [{'TEXT': 'MyLabel'}, {'TEXT': ':', 'OP': '?', 'greedy': 'LONGEST'}]
note that the greedy
flag doesn't change the behavior of matching itself but rather can influence how overlaps are handled in certain custom settings.
OP: '?'
operator.