nlp stanford-nlp spacy feature-extraction named-entity-recognition

SpaCy NER: Can a same word be part of two different entities?

For example:

Sentence: The best product in the world is Nestle Cookies.

Entities:

BRAND: Nestle

PRODUCT: Nestle Cookie

Are the above entities valid, or should I tag them as:

Entities:

BRAND: Nestle

PRODUCT: Cookie

And will it affect model performance?

Solution

From the documentation:

The entity recognizer is constrained to predict only non-overlapping, non-nested spans. The training data should obey the same constraint. If you like, you could have two sentences with the different annotations in your data. I’m not sure whether this would hurt or help your performance, though.

If you want spaCy to learn to recover both annotations, you could have two EntityRecognizer instances in the pipeline. You would need to move the entity annotations into an extension attribute, because you don’t want the second entity recogniser to overwrite the entities set by the first one.

Consequence:

If you want to have a single NER tagger you must label as follows:
Entities: BRAND: Nestle PRODUCT: Cookie

If you want to train two separate NER taggers (one for BRAND and one for PRODUCT) then you can do:
Entities: BRAND: Nestle PRODUCT: Nestle Cookie