Search code examples
pythonspacynamed-entity-recognition

For spacy's NER, do I need to label the entire word as an entity?


I'm fairly new to spacy and NER. I am dealing with a problem where I want to label many examples of short-form text data. I want to map company names to a custom entity CUSTOM.

Example descriptions:

Amazon1337XS324, Amazon4357YT322, *Google, Just *Eat

I am currently labeling the training data. My doubt is whether I should label the entire word as an entity or not e.g. "Amazon1337XS324" or "Amazon", "*Google" or "Google", and "Just *Eat" or "Just Eat".

From this previous post it seems I shouldn't try to remove information that the NER model would find useful. Also, in many labeling tutorials the entire word is always labeled. However, in my use case, the "non-descriptive" subsection of the word could always change, like in the Amazon example, and could end up being noise for the model.

I think I also don't understand if I only provide the entities "Amazon" or "Google" to the spacy's NER model, and new examples come in where there are many new characters next to it in the same word (e.g. Amazon1337XS325, Amazon1337XS326) , will the NER model still be able to identify "Amazon" or "Google" as CUSTOM?


Solution

  • You can't put an NER label on half a token. The tokenizer is run before NER and the NER component attempts to give a label to each whole token, so if you're only interested in part of a token, the NER component wont' be able to figure that out.

    If you don't have some way to separate the tokens in preprocessing, it seems like the only thing you can do is label the whole token. You're right that will make it harder for the model to learn.

    One alternative is to try training a character-level NER component - basically, split your input into individual characters before training.