I am trying to extract only the first speaker's name from a list of texts using spaCy. Currently, my function returns all "PER" tags, but I want to reduce the overhead and get only the first contiguous sequence of "PER" entities. Here’s the example output I get:
Detected Names in Text: ['garcía', 'lópez']
Detected Names in Text: ['j. jesus orozco alfaro']
Detected Names in Text: ['josé guadarrama márquez', 'josé guadarrama']
Detected Names in Text: ['pedro sánchez', 'josé manuel albares', 'pablo iglesias']
But I want the result to be:
Detected Names in Text: ['garcía']
Detected Names in Text: ['j. jesus orozco alfaro']
Detected Names in Text: ['josé guadarrama márquez']
Detected Names in Text: ['pedro sánchez']
Here is the code I am currently using:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("es_core_news_lg")
texts = [
"El Sr. García habló en la sesión. También estuvo presente el Senador López y la Diputada Martínez.",
"PRESIDENCIA DEL C. SENADOR J. JESUS OROZCO ALFARO",
" -ER C. José Guadarrama Márquez: el contrabando del dia, José Guadarrama Márquez",
"El presidente Pedro Sánchez y el Ministro de Asuntos Exteriores José Manuel Albares se reunieron con el Senador Pablo Iglesias."
]
texts = [text.lower() for text in texts]
matcher = Matcher(nlp.vocab)
patterns = [
[{"LOWER": "el"}, {"LOWER": "c"}],
[{"LOWER": "el"}, {"LOWER": "sr"}],
[{"LOWER": "el"}, {"LOWER": "sra"}]
]
matcher.add("LEGISLATIVE_TITLES", patterns)
# Function to find a sequence of PER entities allowing one MISC
def find_per_sequence(doc, start_idx=0):
per_entities = []
misc_count = 0
for ent in doc[start_idx:].ents:
if ent.label_ == "PER":
per_entities.append(ent.text)
elif ent.label_ == "MISC" and misc_count < 1:
misc_count += 1
per_entities.append(ent.text)
else:
break # Should stop if any other entity or second MISC is encountered
return per_entities
for text in texts:
doc = nlp(text)
# Find matches
matches = matcher(doc)
# Extract the first match and its position
title_start = None
title_end = None
for match_id, start, end in matches:
title_start = start
title_end = end
break
# If a title was found, start searching for PER entities from that position
if title_start is not None:
names = find_per_sequence(doc, start_idx=title_end)
else:
names = find_per_sequence(doc)
# Output the detected names for each text
print(f"Detected Names in Text: {names}")
What I'm looking for:
I want to modify the find_per_sequence function so that it returns only the first contiguous sequence of "PER" entities in the text, ignoring any subsequent "PER" entities after encountering a different type of entity. The provided function returns multiple names or partial names, and I need a way to ensure only the first name or sequence is included. How can I achieve this?
The issues is that doc[start_idx:].ents
is only the named entities in that slice of the doc. Thus, you will never process "habló" for the first entry, you will just go straight from "García" to "López". To actually iterate over the tokens so that you see when the PER sequence ends, you have to leave out the .ents
part. Then you just wait until you see the first token with ent_type_
PER and start appending, then break after one of your conditions is met. I ended up refactoring your code a little as I debugged this, but here's an edited version of your program that produces the desired outputs:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("es_core_news_lg")
texts = [
"El Sr. García habló en la sesión. También estuvo presente el Senador López y la Diputada Martínez.",
"PRESIDENCIA DEL C. SENADOR J. JESUS OROZCO ALFARO",
" -ER C. José Guadarrama Márquez: el contrabando del dia, José Guadarrama Márquez",
"El presidente Pedro Sánchez y el Ministro de Asuntos Exteriores José Manuel Albares se reunieron con el Senador Pablo Iglesias.",
]
texts = [text.lower() for text in texts]
matcher = Matcher(nlp.vocab)
patterns = [
[{"LOWER": "el"}, {"LOWER": "c"}],
[{"LOWER": "el"}, {"LOWER": "sr"}],
[{"LOWER": "el"}, {"LOWER": "sra"}],
]
matcher.add("LEGISLATIVE_TITLES", patterns)
# Function to find a sequence of PER entities allowing one MISC
def find_per_sequence(doc: spacy.tokens.Doc, start_idx: int):
per_entities = []
misc_count = 0
per_started = False
for token in doc[start_idx:]:
if token.ent_type_ == "PER":
per_entities.append(token.text)
per_started = True
elif token.ent_type_ == "MISC" and misc_count < 1 and per_started:
misc_count += 1
per_entities.append(token.text)
elif per_started:
break # Should stop if any other entity or second MISC is encountered
return per_entities
for text in texts:
doc = nlp(text)
# Find matches
matches = matcher(doc)
# Extract the first match and its position
_, _, title_end = matches[0] if matches else (None, None, None)
names = find_per_sequence(doc, title_end if title_end else 0)
# Output the detected names for each text
print(f"Detected Names in Text: {names}")