spacy named-entity-recognition entity-linking knowledge-base-population

Attaching custom KB to Spacy "entity_linker" pipe makes NER calls very poor

I want to run an entity linking job using a custom Knowledgebase alone, and not use the second step ML re-ranker that requires a training dataset / Spacy corpus. I want the NEL pipeline to only assign kb_ids based on the Knowledgebase-driven get_candidates() and using the prior_probabilities from my KB object. However, I'm noticing that as soon as I attached my custom KB, the doc.ents calls are super dumb, like it was lobotomized by the entity_linker pipe. Does the entity_linker pipe modify the spans that that ner pipe (which comes before it), calls?

Here are my doc.ents calls when I use the pretrained "en_core_web_lg" model:

nlp = spacy.load("en_core_web_lg")
doc = nlp("The NY Times is my favorite newspaper.")
doc.ents
(The NY Times,)

When I run the same example through the nlp object with my custom KB attached to a downstream NEL pipe, I get this:

entity_linker = nlp.add_pipe("entity_linker")
def create_kb(vocab):
    kb = DefaultKB(vocab=vocab, entity_vector_length=300)
    kb.from_disk(f"assets/en/kb")
    return kb

entity_linker.set_kb(create_kb)
nlp.initialize()
doc = nlp("The NY Times is my favorite newspaper.")
print([(ent.text, ent.label_) for ent in doc.ents])  # Output of the NER
print([(ent.text, ent.kb_id_) for ent in doc.ents if ent.kb_id_])  # Output of the entity linker

[('The', 'ORG'), ('Times', 'ORG'), ('is', 'ORG'), ('my', 'ORG'), ('favorite', 'ORG'), ('newspaper', 'ORG'), ('.', 'ORG')]
[('The', 'Q3048768'), ('Times', 'Q11259'), ('is', 'NIL'), ('my', 'NIL'), ('favorite', 'NIL'), ('newspaper', 'NIL'), ('.', 'NIL')]

So the NER pipe is definitely calling different NER spans after I attach my custom Knowledgebase object. Please educate me on this pipe and whether or not I can attached my KB object to be intelligent straight away, without messing up the pretrained model's NER intelligence.

Solution

What happens here is that this line

nlp.initialize()

actually re-initializes all trained components in your pipeline: it sets all weights back to random initializations, effectively producing garbage as you saw from the NER results.

You do need to initialize the new entity_linker component however. You can do so by calling it on the component directly:

entity_linker.initialize(...)

The latter will require an additional get_examples parameter (https://spacy.io/api/entitylinker#initialize) which might be a small hassle. Alternatively, you can disable the pipes that should be kept as is in a context manager when making the call on the nlp object:

with nlp.select_pipes(disable="ner"):
    nlp.initialize()

You can read more about this mechanism in the usage docs: https://spacy.io/usage/training#initialization. And in the API docs: https://spacy.io/api/language#initialize

By the way - all of this should happen "automagically" when you use the config system in spaCy v3 and set the "frozen" components correctly, cf. https://spacy.io/usage/training#config-custom.

You would have something like this:

[nlp]
lang = "en"
pipeline = ["ner","entity_linker"]

[components.ner]
source = "en_core_web_lg"
component = "ner"

[components.entity_linker]
factory = "entity_linker"

[training]
frozen_components = ["ner"]
...

For the config route, you can take inspiration from this example project: https://github.com/explosion/projects/tree/v3/benchmarks/nel

Hope that resolves things for you!