Search code examples
pythonnlpspacy

Break after first PER sequence found with Spacy


I am trying to extract only the first speaker's name from a list of texts using spaCy. Currently, my function returns all "PER" tags, but I want to reduce the overhead and get only the first contiguous sequence of "PER" entities. Here’s the example output I get:

Detected Names in Text: ['garcía', 'lópez']
Detected Names in Text: ['j. jesus orozco alfaro']
Detected Names in Text: ['josé guadarrama márquez', 'josé guadarrama']
Detected Names in Text: ['pedro sánchez', 'josé manuel albares', 'pablo iglesias']

But I want the result to be:

Detected Names in Text: ['garcía']
Detected Names in Text: ['j. jesus orozco alfaro']
Detected Names in Text: ['josé guadarrama márquez']
Detected Names in Text: ['pedro sánchez']

Here is the code I am currently using:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("es_core_news_lg")

texts = [
    "El Sr. García habló en la sesión. También estuvo presente el Senador López y la Diputada Martínez.",
    "PRESIDENCIA DEL C. SENADOR J. JESUS OROZCO ALFARO",
    "            -ER C. José Guadarrama Márquez: el contrabando del dia, José Guadarrama Márquez",
    "El presidente Pedro Sánchez y el Ministro de Asuntos Exteriores José Manuel Albares se reunieron con el Senador Pablo Iglesias."
]
texts = [text.lower() for text in texts]

matcher = Matcher(nlp.vocab)

patterns = [
    [{"LOWER": "el"}, {"LOWER": "c"}],
    [{"LOWER": "el"}, {"LOWER": "sr"}],
    [{"LOWER": "el"}, {"LOWER": "sra"}]
]

matcher.add("LEGISLATIVE_TITLES", patterns)

# Function to find a sequence of PER entities allowing one MISC
def find_per_sequence(doc, start_idx=0):
    per_entities = []
    misc_count = 0
    
    for ent in doc[start_idx:].ents:
        if ent.label_ == "PER":
            per_entities.append(ent.text)
        elif ent.label_ == "MISC" and misc_count < 1:
            misc_count += 1
            per_entities.append(ent.text)
        else:
            break  # Should stop if any other entity or second MISC is encountered
    
    return per_entities

for text in texts:
    doc = nlp(text)
    
    # Find matches
    matches = matcher(doc)
    
    # Extract the first match and its position
    title_start = None
    title_end = None
    for match_id, start, end in matches:
        title_start = start
        title_end = end
        break

    # If a title was found, start searching for PER entities from that position
    if title_start is not None:
        names = find_per_sequence(doc, start_idx=title_end)
    else:
        names = find_per_sequence(doc)

    # Output the detected names for each text
    print(f"Detected Names in Text: {names}")

What I'm looking for:

I want to modify the find_per_sequence function so that it returns only the first contiguous sequence of "PER" entities in the text, ignoring any subsequent "PER" entities after encountering a different type of entity. The provided function returns multiple names or partial names, and I need a way to ensure only the first name or sequence is included. How can I achieve this?


Solution

  • The issues is that doc[start_idx:].ents is only the named entities in that slice of the doc. Thus, you will never process "habló" for the first entry, you will just go straight from "García" to "López". To actually iterate over the tokens so that you see when the PER sequence ends, you have to leave out the .ents part. Then you just wait until you see the first token with ent_type_ PER and start appending, then break after one of your conditions is met. I ended up refactoring your code a little as I debugged this, but here's an edited version of your program that produces the desired outputs:

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("es_core_news_lg")
    
    texts = [
        "El Sr. García habló en la sesión. También estuvo presente el Senador López y la Diputada Martínez.",
        "PRESIDENCIA DEL C. SENADOR J. JESUS OROZCO ALFARO",
        "            -ER C. José Guadarrama Márquez: el contrabando del dia, José Guadarrama Márquez",
        "El presidente Pedro Sánchez y el Ministro de Asuntos Exteriores José Manuel Albares se reunieron con el Senador Pablo Iglesias.",
    ]
    texts = [text.lower() for text in texts]
    
    matcher = Matcher(nlp.vocab)
    
    patterns = [
        [{"LOWER": "el"}, {"LOWER": "c"}],
        [{"LOWER": "el"}, {"LOWER": "sr"}],
        [{"LOWER": "el"}, {"LOWER": "sra"}],
    ]
    
    matcher.add("LEGISLATIVE_TITLES", patterns)
    
    
    # Function to find a sequence of PER entities allowing one MISC
    def find_per_sequence(doc: spacy.tokens.Doc, start_idx: int):
        per_entities = []
        misc_count = 0
        per_started = False
    
        for token in doc[start_idx:]:
            if token.ent_type_ == "PER":
                per_entities.append(token.text)
                per_started = True
            elif token.ent_type_ == "MISC" and misc_count < 1 and per_started:
                misc_count += 1
                per_entities.append(token.text)
            elif per_started:
                break  # Should stop if any other entity or second MISC is encountered
    
        return per_entities
    
    
    for text in texts:
        doc = nlp(text)
    
        # Find matches
        matches = matcher(doc)
    
        # Extract the first match and its position
        _, _, title_end = matches[0] if matches else (None, None, None)
    
        names = find_per_sequence(doc, title_end if title_end else 0)
    
        # Output the detected names for each text
        print(f"Detected Names in Text: {names}")