python nlp spacy named-entity-recognition

Longest match only with Spacy Phrasematcher

I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.

Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:

ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.

Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?

Solution

The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.