I am using regex over a whole document to catch the spans in which such regex occurs:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
expression = r"[Uu](nited|\.?) ?[Ss](tates|\.?)"
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
# This is a Span object or None
# if match doesn't map to valid token sequence
if span is not None:
print("Found match:", span.text)
There is a way to get the span (list of tokens) corresponding to the regex match on the doc even if the boundaries of the regex match do not correspond to token boundaries. See: How can I expand the match to a valid token sequence? In https://spacy.io/usage/rule-based-matching
So far so good.
Now that I have a collectuon of spans how do I convert them into entities? I am aware of the entity ruler: The EntityRuler is a pipeline component (see also the link above) but that entityruler takes patterns as inputs to search in the doc and not spans.
If I want to use regex over the whole document to get the collection os spans I want to convert into ents what is the next step here? Entityruler? How? Or something else?
Put simpler:
nlp = spacy.load("en_core_web_sm")
doc = nlp("The aplicable law is article 102 section b sentence 6 that deals with robery")
I would like to generate an spacy ent (entity) out of doc[5,10] with label "law" in order to be able to: A) loop over all the law entities in the texts B) use the visualizer to display the different entities contained in the doc
The most flexible way to add spans as entities to a doc is to use Doc.set_ents
:
from spacy.tokens import Span
span = doc.char_span(start, end, label="ENT")
doc.set_ents(entities=[span], default="unmodified")
Use the default
option to specify how to set all the other tokens in the doc. By default the other tokens are set to O
, but you can use default="unmodified"
to leave them untouched, e.g. if you're adding entities incrementally.