Search code examples
pythonnlpspacy

spacy rule-matcher extract value from matched sentence


I have a custom rule matching in spacy, and I am able to match some sentences in a document. I would like to extract some numbers now from the matched sentences. However, the matched sentences do not have always have the same shape and form. What is the best way to do this?

# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]

pattern = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
]

pattern_1 = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern, pattern_1)

for index, text in enumerate(texts):
    print(f"Case {index}")
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

my output will be

Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5

I would like to return the number (square meters) only. Something like [31, 31, 31, 31, 31.2] rather than the full text. What is the correct way to do this in spacy?


Solution

  • Since each match contains a single occurrence of LIKE_NUM entity you may just parse the match subtree and return the first occurrence of such a token:

    value = [token for token in span.subtree if token.like_num][0]
    

    Test:

    results = []
    for text in texts:
        doc = nlp(text)
        matches = matcher(doc)
        for match_id, start, end in matches:
            span = doc[start:end]  # The matched span
            results.append([token for token in span.subtree if token.like_num][0])
    
    print(results) # => [31, 31, 31, 31, 31,2]