I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code:
nlp = spacy.load('en_core_web_sm')
# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"
pattern = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_DIGIT": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"}
]
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
I have two problems : 1 - the pattern should be able to match all cases 1 to 5, but in my case 1 the output is
4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq
which to me seems that it is a duplicate match.
2 - case 6 should not match, but instead, with my pattern it is matched. Any suggestion on how to improve this?
EDIT: is it possible to build an OR condition within the pattern? something like
pattern = [
{"POS": "DET", "OP": "?"},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
[
[{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True}]
OR
[{"LIKE_NUM": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"} ]
]
]
You cannot use an OR like that, but you may define separate patterns for the same label. So, you need two patterns, one will match a number with either sq
or square
or meters
or a combination of these words before it, and another pattern that matches a number with at least one of these words after.
Code snippet:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
"the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"LIKE_NUM": True}
]
pattern2 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
]
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
Output:
4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square
The {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
part matches one or more tokens (due to "OP": "+"
) that match the regex:
^
- start of the token(?i:
- start of a case insensitive modifier group:
sq(?:uare)?
- sq
or square
|
- orm(?:et(?:er|re)s?)?
- m
, meter
/metre
or meters
/metres
)
- end of the group$
- end of the string (token here).