Search code examples
machine-learningnlpspacynamed-entity-recognition

Train Spacy NER model with 'en_core_web_sm' as base model


I am using Spacy to train my NER model with new entities and I am using en_core_web_sm model as my base model because I also want to detect the basic entities (ORG, PERSON, DATE, etc). I ran en_core_web_sm model over unlabelled sentences, and adding their annotations to my training set.

After I finished with that, now I want to create the training data for the new entities. For example, I want to add a new entity called FRUIT. I have a bunch of sentences (in addition to those that were annotated using en_core_web_sm earlier) that I am going to annotate. The sentence example is:

"James likes eating apples".

My question is: Do I still need to annotate "James" as PERSON as well as annotating "apples" as FRUIT? Or whether I don't need to do it because I already have another bunch of sentences that were annotated with PERSON entity using en_core_web_sm model earlier.


Solution

  • Short answer:

    Yes, if you want to keep your model precise.

    Long answer:

    NER is implemented using Machine Learning algorithms. These classify a token as a Entity based on learned distributions and surrounding tokens.

    Therefore, if you provide several samples of annotated text without marking a word (token) as a specific Entity that it usually represents, you may affect your model precision by providing samples to your model where that token is unimportant.