Search code examples
pythonnlpspacy

Patterns with ENT_TYPE from manually labelled Span not working


As an alternative to accomplishing this: Patterns with multi-terms entries in the IN attribute

I wrote the following code to match phrases, label them, and then use them in EntityRuler patterns:

# %%
import spacy
from spacy.matcher import PhraseMatcher
from spacy.pipeline import EntityRuler
from spacy.tokens import Span

class PhraseRuler(object):
    name = 'phrase_ruler'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

nlp = spacy.load("en_core_web_lg")

entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
nlp.add_pipe(entity_matcher, before="ner")


ruler = EntityRuler(nlp)
patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"ENT_TYPE": "GREETING"}]}]
ruler.add_patterns(patterns)
#ruler.to_disk("./data/patterns.jsonl")
nlp.add_pipe(ruler)

print(nlp.pipe_names) 

doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
print(doc.to_json())

Unfortunately this does not work as it does not return my SUPER_GREETING:

'ents': [
   {'start': 0, 'end': 4, 'label': 'PERSON'}, 
   {'start': 10, 'end': 21, 'label': 'GREETING'}, 
   {'start': 39, 'end': 51, 'label': 'GREETING'}
]

What am I doing wrong? How do I fix it?


Solution

  • You have the right idea, but the problem here is an intrinsic design choice in spaCy that any token can only be part of one named entity. So you can't have "Warm Welcome" being both a "GREETING" as well as part of a "SUPER_GREETING".

    One way you could work around this is by using custom extensions. For instance, one solution would be to store the GREETING bit on the token level:

    Token.set_extension("mylabel", default="")
    

    And then we adjust the PhraseRuler.__call__ so that it doesn't write to doc.ents but instead does this:

    for token in span:
        token._.mylabel = "MY_GREETING"
    

    Now, we can rewrite the SUPER_GREETING pattern to:

    patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]
    

    which will match "super" followed by one or more "MY_GREETING" tokens. It will match greedily and output "super Warm Welcome" as hit.

    Here's the resulting code snippet, starting from your code and making the adjustements as described:

        Token.set_extension("mylabel", default="")
    
        class PhraseRuler(object):
            name = 'phrase_ruler'
    
            def __init__(self, nlp, terms, label):
                patterns = [nlp(term) for term in terms]
                self.matcher = PhraseMatcher(nlp.vocab)
                self.matcher.add(label, None, *patterns)
    
            def __call__(self, doc):
                matches = self.matcher(doc)
                for label, start, end in matches:
                    span = Span(doc, start, end, label=label)
                    for token in span:
                        token._.mylabel = "MY_GREETING"
                return doc
    
        nlp = spacy.load("en_core_web_lg")
    
        entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
        nlp.add_pipe(entity_matcher, name="entity_matcher", before="ner")
    
        ruler = EntityRuler(nlp)
        patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]
        ruler.add_patterns(patterns)
        nlp.add_pipe(ruler, after="entity_matcher")
    
        print(nlp.pipe_names)
    
        doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
        print("TOKENS:")
        for token in doc:
            print(token.text, token._.mylabel)
        print()
    
        print("ENTITIES:")
        for ent in doc.ents:
            print(ent.text, ent.label_)
    

    Which outputs

    TOKENS:
    Mary 
    said 
    Best MY_GREETING
    Wishes MY_GREETING
    and 
    I 
    said 
    super 
    Warm MY_GREETING
    Welcome MY_GREETING
    . 
    
    ENTITIES:
    Mary PERSON
    super Warm Welcome SUPER_GREETING
    

    This may not be exactly what you need/want - but I hope it helps you move forward with an alternative solution for your specific use-case. If you do want the normal "GREETING" spans in the final doc.ents, maybe you can reassemble them in post-processing, after the EntityRuler has run, e.g. by moving the custom attributes to doc.ents if they don't overlap, or by keeping a cache of the spans somewhere.