Search code examples
pythonnamed-entity-recognitionspacy-3

Add custom NER to Spacy 3 pipeline


I am trying to build a custom Spacy pipeline based off the en_core_web_sm pipeline. From what I can tell the ner has been added correctly as it is displayed in the pipe names when printed(see below). For some reason when the model is tested on text I am not getting any results but when the custom ner is used by itself the correct entities are extracted and labelled. I am using Spacy 3.0.8 and en_core_web_sm pipeline 3.0.0.

import spacy


crypto_nlp = spacy.load('model-best')
nlp = spacy.load('en_core_web_sm')

nlp.add_pipe('ner', source=crypto_nlp, name="crypto_ner", before="ner")

print(nlp.pipe_names)

text = 'Ethereum'

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Output: '['tok2vec', 'tagger', 'parser', 'crypto_ner', 'ner', 'attribute_ruler', 'lemmatizer']'

But when I use my ner model:

doc = crypto_nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Output: 'Ethereum ETH'


Solution

  • It's not clear from the details in the question, but my guess is that your crypto_nlp ner depends on a separate tok2vec component that's not being included when you source.

    Since this tok2vec won't be shared, it's easiest to modify the ner component to include a standalone copy of the tok2vec, which is called "replacing listeners": https://spacy.io/api/language#replace_listeners

    If crypto_nlp has nlp.pipe_names as ['tok2vec', 'ner'], then this should replace the listener before loading it into the second pipeline, so it's now a standalone component:

    crypto_nlp.replace_listeners("tok2vec", "ner", ["model.tok2vec"])
    nlp.add_pipe('ner', source=crypto_nlp, name="crypto_ner", before="ner")