I am using Spacy to train my NER model with new entities and I am using en_core_web_sm
model as my base model because I also want to detect the basic entities (ORG
, PERSON
, DATE
, etc). I ran en_core_web_sm
model over unlabelled sentences, and adding their annotations to my training set.
After I finished with that, now I want to create the training data for the new entities. For example, I want to add a new entity called FRUIT
. I have a bunch of sentences (in addition to those that were annotated using en_core_web_sm
earlier) that I am going to annotate. The sentence example is:
"James likes eating apples".
My question is: Do I still need to annotate "James" as PERSON
as well as annotating "apples" as FRUIT
? Or whether I don't need to do it because I already have another bunch of sentences that were annotated with PERSON
entity using en_core_web_sm
model earlier.
Short answer:
Yes, if you want to keep your model precise.
Long answer:
NER is implemented using Machine Learning algorithms. These classify a token as a Entity based on learned distributions and surrounding tokens.
Therefore, if you provide several samples of annotated text without marking a word (token) as a specific Entity that it usually represents, you may affect your model precision by providing samples to your model where that token is unimportant.