I would like to run some analysis on documents using different Spacy tools, though I am interested in the Dependency Matcher in particular.
It just so happens that for these documents, I already have the character offsets of some difficult-to-parse entities. A somewhat-contrived example:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [
{"offsets":(0,5), "id": "apple"},
{"offsets":(41,54), "id": "san-francisco"}
]
# do something here so that `nlp` knows about those entities
doc = nlp(text)
I've thought about doing something like this:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [{"offsets":(0,5), "id": "apple"}, {"offsets":(41,54), "id": "san-francisco"}]
ruler = nlp.add_pipe("entity_ruler")
patterns = []
for e in already_known_entities:
patterns.append({
"label": "GPE",
"pattern": text[e["offsets"][0]:e["offsets"][1]]
})
ruler.add_patterns(patterns)
doc = nlp(text)
This technically works, and it's not the worst solution in the world, but I was still wondering if offsets can be added to the nlp
object directly. As far as I can tell, the Matcher docs don't show anything like this. I also understand this might be a bit of a departure from typical Matcher behavior, where a pattern can be applied to all documents in a corpus--whereas here I want to tag entities at certain offsets only for particular documents. Offsets from one document do not apply to other documents.
You are looking for Doc.char_span.
doc = "Blah blah blah"
span = doc.char_span(0, 4, label="BLAH")
doc.ents = [span]
Note that doc.ents
is a tuple, so you can't append to it, but you can convert it to a list and set the ents, for example.