Search code examples
pythonnlpspacy

spacy rule matcher on unit of measure before or after digit


I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

I have two problems : 1 - the pattern should be able to match all cases 1 to 5, but in my case 1 the output is

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq 

which to me seems that it is a duplicate match.

2 - case 6 should not match, but instead, with my pattern it is matched. Any suggestion on how to improve this?

EDIT: is it possible to build an OR condition within the pattern? something like

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

Solution

  • You cannot use an OR like that, but you may define separate patterns for the same label. So, you need two patterns, one will match a number with either sq or square or meters or a combination of these words before it, and another pattern that matches a number with at least one of these words after.

    Code snippet:

    import spacy
    from spacy.matcher import Matcher
    nlp = spacy.load("en_core_web_sm")
    
    texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
         "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
    pattern1 = [
          {"IS_STOP": True}, 
          {"LOWER": "surface"}, 
          {"LEMMA": "be", "OP": "?"}, 
          {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
          {"LIKE_NUM": True}
        ]
    pattern2 = [
          {"IS_STOP": True}, 
          {"LOWER": "surface"}, 
          {"LEMMA": "be", "OP": "?"}, 
          {"IS_ALPHA": True, "OP": "?"},
          {"LIKE_NUM": True},
          {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
        ]
    
    matcher = Matcher(nlp.vocab, validate=True)
    matcher.add("Surface", None, pattern1)
    matcher.add("Surface", None, pattern2)
    
    for text in texts:
      doc = nlp(text)
      matches = matcher(doc)
      for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)
    

    Output:

    4898162435462687487 Surface 0 5 the surface is 31 sq
    4898162435462687487 Surface 0 5 the surface is sq 31
    4898162435462687487 Surface 0 6 the surface is square meters 31
    4898162435462687487 Surface 0 5 the surface is 31 square
    4898162435462687487 Surface 0 6 the surface is about 31 square
    

    The {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"} part matches one or more tokens (due to "OP": "+") that match the regex:

    • ^ - start of the token
    • (?i: - start of a case insensitive modifier group:
      • sq(?:uare)? - sq or square
      • | - or
      • m(?:et(?:er|re)s?)? - m, meter/metre or meters/metres
    • ) - end of the group
    • $ - end of the string (token here).