python nlp spacy named-entity-recognition

SpaCy: how do you add custom NER labels to a pre-trained model?

I am new to SpaCy and NLP. I am using SpaCy v 3.1 and Python 3.9.7 64-bit.

My objective: to use a pre-trained SpaCy model (en_core_web_sm) and add a set of custom labels to the existing NER labels (GPE, PERSON, MONEY, etc.) so that the model can recognize both the default AND the custom entities.

I've looked at the SpaCy documentation and what I need seems to be an EntityRecogniser, specifically a new pipe.

However, it is not really clear to me at what point in my workflow I should add this new pipe, since in SpaCy 3 the training happens in CLI, and from the docs it's not even clear to me where the pre-trained model is called.

Any tutorials or pointers you might have are highly appreciated.

This is what I think should be done, but I am not sure how:

import spacy
from spacy import displacy
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from spacy.pipeline import EntityRecognizer

# Load model
nlp = spacy.load("en_core_web_sm")

# Register custom component and turn a simple function into a pipeline component
@Language.factory('new-ner')
def create_bespoke_ner(nlp, name):
    
    # Train the new pipeline with custom labels here??
    
    return LanguageDetector()

# Add custom pipe
custom = nlp.add_pipe("new-ner")

This is what my config file looks like so far. I suspect my new pipe needs to go next to "tok2vec" and "ner".

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

Solution

For Spacy 3.2 I did it this way:

import spacy
import random
from spacy import util
from spacy.tokens import Doc
from spacy.training import Example
from spacy.language import Language

def print_doc_entities(_doc: Doc):
    if _doc.ents:
        for _ent in _doc.ents:
            print(f"     {_ent.text} {_ent.label_}")
    else:
        print("     NONE")

def customizing_pipeline_component(nlp: Language):
    # NOTE: Starting from Spacy 3.0, training via Python API was changed. For information see - https://spacy.io/usage/v3#migrating-training-python
    train_data = [
        ('We need to deliver it to Festy.', [(25, 30, 'DISTRICT')]),
        ('I like red oranges', [])
    ]

    # Result before training
    print(f"\nResult BEFORE training:")
    doc = nlp(u'I need a taxi to Festy.')
    print_doc_entities(doc)

    # Disable all pipe components except 'ner'
    disabled_pipes = []
    for pipe_name in nlp.pipe_names:
        if pipe_name != 'ner':
            nlp.disable_pipes(pipe_name)
            disabled_pipes.append(pipe_name)

    print("   Training ...")
    optimizer = nlp.create_optimizer()
    for _ in range(25):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            example = Example.from_dict(doc, {"entities": entity_offsets})
            nlp.update([example], sgd=optimizer)

    # Enable all previously disabled pipe components
    for pipe_name in disabled_pipes:
        nlp.enable_pipe(pipe_name)

    # Result after training
    print(f"Result AFTER training:")
    doc = nlp(u'I need a taxi to Festy.')
    print_doc_entities(doc)

def main():
    nlp = spacy.load('en_core_web_sm')
    customizing_pipeline_component(nlp)


if __name__ == '__main__':
    main()