Search code examples
pythonspacynamed-entity-recognition

How can I make SpaCy recognize all my given entities


I have quite a list of patterns in JSONL format that I loaded and added to the entity ruler

new_ruler = EntityRuler(nlp).from_disk(project_path + "data/skill_patterns.jsonl")
nlp.add_pipe(new_ruler)

When I print the results: print([(ent.text, ent.label_) for ent in doc.ents]) My output is:

[('data science','SKILL|data-science'), ('CV', 'ORG'), ('Kandidaat', 'FAC'), ('één', 'CARDINAL'), ('LSTM',
 'ORG'), ('Parts', 'GPE'), ('Speech', 'GPE'), ('POS', 'ORG'), ('Entity Recognition', 'ORG'), 
('NER', 'ORG'), ('Word2vec', 'ORG'), ('GloVe', 'ORG'), ('Recursive', 'NORP'), ('Neural Networks', 'ORG'),
 ('Ensemble', 'PERSON'), ('Dynamic', 'NORP'), ('Intent detection', 'PERSON'), ('Phrase matching.-', 'ORG'),
 ('Microsoft', 'NORP'), ('Azure.-', 'ORG'), ('één', 'CARDINAL'), ('Python', 'WORK_OF_ART'),
 ('Pytorch', 'GPE'), ('Django', 'GPE'), ('GoLanguage.-', 'GPE'), ('Kandidaat', 'FAC'), ('1 november 2020', 'DATE')]

Now I know for a fact that for example ('Pytorch', 'GPE') or ('Django', 'GPE') are in my pattern list and should be recognized as SKILL instead of the entities they got assigned now. This goes for quite a few other 'skills' as well.

{"label":"SKILL|django","pattern":[{"LOWER":"django"}]}
{"label":"SKILL|pytorch","pattern":[{"LOWER":"pytorch"}]}

Is there someone that knows why it does not adhere to my self created entities?

Is there a way that I can prioritize my entities above the ones already in the model?

Thanks!


Solution

  • I've found a solution.

    By adding the new_ruler before the NER (after parser) in the pipeline, it gives the created entities priority

    nlp.add_pipe(new_ruler, after='parser')