spacy rule matcher on unit of measure before or after digit

I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

I have two problems : 1 - the pattern should be able to match all cases 1 to 5, but in my case 1 the output is

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq

which to me seems that it is a duplicate match.

2 - case 6 should not match, but instead, with my pattern it is matched. Any suggestion on how to improve this?

EDIT: is it possible to build an OR condition within the pattern? something like

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

Solution

You cannot use an OR like that, but you may define separate patterns for the same label. So, you need two patterns, one will match a number with either sq or square or meters or a combination of these words before it, and another pattern that matches a number with at least one of these words after.

Code snippet:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
     "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
      {"LIKE_NUM": True}
    ]
pattern2 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True},
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
    ]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)

for text in texts:
  doc = nlp(text)
  matches = matcher(doc)
  for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

Output:

4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square

The {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"} part matches one or more tokens (due to "OP": "+") that match the regex:

^ - start of the token
(?i: - start of a case insensitive modifier group:
- sq(?:uare)? - sq or square
- | - or
- m(?:et(?:er|re)s?)? - m, meter/metre or meters/metres
) - end of the group
$ - end of the string (token here).