For example:
Sentence: The best product in the world is Nestle Cookies.
Entities:
BRAND: Nestle
PRODUCT: Nestle Cookie
Are the above entities valid, or should I tag them as:
Entities:
BRAND: Nestle
PRODUCT: Cookie
And will it affect model performance?
From the documentation:
The entity recognizer is constrained to predict only non-overlapping, non-nested spans. The training data should obey the same constraint. If you like, you could have two sentences with the different annotations in your data. I’m not sure whether this would hurt or help your performance, though.
If you want spaCy to learn to recover both annotations, you could have two EntityRecognizer instances in the pipeline. You would need to move the entity annotations into an extension attribute, because you don’t want the second entity recogniser to overwrite the entities set by the first one.
Consequence:
If you want to have a single NER tagger you must label as follows:
Entities: BRAND: Nestle PRODUCT: Cookie
If you want to train two separate NER taggers (one for BRAND and one for PRODUCT) then you can do:
Entities: BRAND: Nestle PRODUCT: Nestle Cookie