Search code examples
pythonregexnlpspacy

How to locate Spacy span from string range?


I study natural language processing with spaCy and decided to write a pipe that will identify addresses like Dr, Mr, Mrs followed by names as person recognised entity. In the process, I decided to use regular expressions to identify the prefix part, and I found out that whenever I search RE pattern, it returns me the indexes in the whole source text. My task is to extract the addressed names and add them as Span objects, however the constructor demands me to pass a Doc object range.

Is it possible to locate the span in doc for a string position? I could have enumerated through every span individually storing the index as a separate variable and search the regular expression in it, and I will be satisfied with this solution, however is it possible to either translate string position to a span it belongs to or create spans for newly recognised entities in a different way?

import re
import spacy as sp
from spacy.tokens import Doc, Span
from spacy.language import Language
english = sp.load('en_core_web_sm') #Load the English language model
with open('sample.txt', 'r') as f:
    source = f.read() #Text used from Wikipedia https://en.wikipedia.org/wiki/Chocolate

@Language.component("addresses")
def searchNames(doc: Doc) -> Doc:
    """Searches for names in the text marked with addressing and adds them to the named entities."""
    #The pattern expression searches for names with addresses Mr., Mrs., Ms., Dr., Prof., etc.
    pattern = re.compile(r"(P|M|D)(r|s)(s|.)?") #(f|.)?\s[A-Z][a-z]+
    matches = [(match.start(), match.end()) for match in pattern.finditer(doc.text)]
    #Resolve the index error.
    #Add the names to the named entities.
    names = [Span(doc, start, end, "PERSON") for start, end in matches]
    doc.ents = tuple(list(doc.ents) + names)
    return doc

english.add_pipe("addresses")
chocolate = english(source)
for entity in chocolate.ents:
    if entity.label_ == "PERSON":
        print(entity.text, entity.label_)


Solution

  • Just use Doc.char_span.

    doc = nlp("I like New York")
    span = doc.char_span(7, 15, label="GPE")
    assert span.text == "New York"
    

    Also be sure to check the return value, as if your indices don't line up with token boundaries it will return None.