python machine-learning nlp spacy dependency-parsing

Add known matches to Spacy document with character offsets

I would like to run some analysis on documents using different Spacy tools, though I am interested in the Dependency Matcher in particular.

It just so happens that for these documents, I already have the character offsets of some difficult-to-parse entities. A somewhat-contrived example:

from spacy.lang.en import English

nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [
    {"offsets":(0,5), "id": "apple"}, 
    {"offsets":(41,54), "id": "san-francisco"}
]

# do something here so that `nlp` knows about those entities 

doc = nlp(text)

I've thought about doing something like this:

from spacy.lang.en import English

nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [{"offsets":(0,5), "id": "apple"}, {"offsets":(41,54), "id": "san-francisco"}]

ruler = nlp.add_pipe("entity_ruler")
patterns = []
for e in already_known_entities:
    patterns.append({
        "label": "GPE",
        "pattern": text[e["offsets"][0]:e["offsets"][1]]
    })
ruler.add_patterns(patterns)

doc = nlp(text)

This technically works, and it's not the worst solution in the world, but I was still wondering if offsets can be added to the nlp object directly. As far as I can tell, the Matcher docs don't show anything like this. I also understand this might be a bit of a departure from typical Matcher behavior, where a pattern can be applied to all documents in a corpus--whereas here I want to tag entities at certain offsets only for particular documents. Offsets from one document do not apply to other documents.

Solution

You are looking for Doc.char_span.

doc = "Blah blah blah"
span = doc.char_span(0, 4, label="BLAH")
doc.ents = [span]

Note that doc.ents is a tuple, so you can't append to it, but you can convert it to a list and set the ents, for example.