Search code examples
pythonnlpspacynamed-entity-recognition

In spacy: Add a span (doc[a:b]) as entity in a spacy doc (python)


I am using regex over a whole document to catch the spans in which such regex occurs:

import spacy
import re

nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"[Uu](nited|\.?) ?[Ss](tates|\.?)"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None 
    # if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

There is a way to get the span (list of tokens) corresponding to the regex match on the doc even if the boundaries of the regex match do not correspond to token boundaries. See: How can I expand the match to a valid token sequence? In https://spacy.io/usage/rule-based-matching

So far so good.

Now that I have a collectuon of spans how do I convert them into entities? I am aware of the entity ruler: The EntityRuler is a pipeline component (see also the link above) but that entityruler takes patterns as inputs to search in the doc and not spans.

If I want to use regex over the whole document to get the collection os spans I want to convert into ents what is the next step here? Entityruler? How? Or something else?

Put simpler:

nlp = spacy.load("en_core_web_sm")
doc = nlp("The aplicable law is article 102 section b sentence 6 that deals with robery")

I would like to generate an spacy ent (entity) out of doc[5,10] with label "law" in order to be able to: A) loop over all the law entities in the texts B) use the visualizer to display the different entities contained in the doc


Solution

  • The most flexible way to add spans as entities to a doc is to use Doc.set_ents:

    from spacy.tokens import Span
    
    span = doc.char_span(start, end, label="ENT")
    doc.set_ents(entities=[span], default="unmodified")
    

    Use the default option to specify how to set all the other tokens in the doc. By default the other tokens are set to O, but you can use default="unmodified" to leave them untouched, e.g. if you're adding entities incrementally.

    https://spacy.io/api/doc#set_ents