Search code examples
nlpentityspacy

How to merge entities in spaCy via rules


I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?

my code:

analytics = [
    [{'LOWER':'risk'}],
    [{'LOWER':'pnl'}],
    [{'LOWER':'p&l'}],
    [{'LOWER':'return'}],
    [{'LOWER':'returns'}]
]


matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)

doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "ANALYTIC"
    span = Span(doc, start, end, label="ANALYTIC")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.

# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")

# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)

Solution

  • So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:

    doc.ents = list(doc.ents) + [span]
    

    Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.

    There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.

    You can directly modify entity labels, like this:

    for ent in doc.ents:
        if len(ent.text) == 3:
            ent.label_ = "CHECK"
        print(ent.label_, ent, sep="\t")
    

    If you want to use the EntityRuler it would look like this:

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    
    ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
    
    patterns = [
            {"label": "ANALYTIC", "pattern": 
                [{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
    
    ruler.add_patterns(patterns)
    
    text = "P&L reported amazing returns this year."
    
    doc = nlp(text)
    
    for ent in doc.ents:
        print(ent.label_, ent, sep="\t")
    

    One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.