python-3.x initialization label spacy-3 categorization

Issues in initializing labels in Spacy spancat pipeline

I am working in setting up a spaCy spancategorizer with multiple labels. I have annotated my training data using doc.span[label] for the different labels, and save it as train/dev data. I receive this error when I run the training:

ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method.

I have added my labels in the [components.spancat] spans_key part of the config file.

I have generated an Example object, which contains labels as well, but those labels are not recognized either.

def get_examples():
    for pred, gold in zip(all_annotated_pre_shuffled_docs, docs_without_annotations):
        print(pred,gold)
        yield Example(pred, gold)

spancat  = nlp.add_pipe("spancat", before = "textcat")
spancat.initialize(get_examples, nlp=nlp)

For the Example,there is a parse_gold_doc() function not defined in spaCy. It may be that what I am missing? Please if you can provide some guidance on what I am missing. I am new using spaCy, so additional feedback is very welcome. It has been hard to find examples on the spancategorizer outside Prodigy. Thank you very much.

My doc annotations:

     for doc in docs:
         phrase_matches = phrase_matcher(doc)
    
         # Initializaing the SpanGroups for each doc
         for label in labels:
             doc.spans[label]=[]
    
         # phrase_matches detection and labeling of spans, and generation of SpanGrups for each doc
         for match_id, start, end in phrase_matches:
                 match_label = nlp.vocab.strings[match_id]
                 span = doc[start:end]
                 span = Span(doc, start, end, label = match_label)
            
                 # Set up of the SpanGroup for each doc, for the different labels
                 doc.spans[match_label].append(span)

for saving my data I am using the following (only train and dev I am using in spaCy):

random.shuffle(docs)
n       = len(docs)
n_train = 2*n//3
n_dev   = max(30, 3*n//4 - 2*n//3)
n_test  = n - n_train - n_dev

train_docs = docs[:n_train+1]
dev_docs   = docs[n_train+1: n-n_test]
test_docs  = docs[n-n_test+1:]

# Create and save a collection of training docs
train_docbin = DocBin(docs = train_docs)
train_docbin.to_disk("data/train.spacy")
# Create and save a collection of evaluation docs
dev_docbin = DocBin(docs = dev_docs)
dev_docbin.to_disk("data/dev.spacy")

and my configuration file

[paths]
train = data/train.spacy
dev = data/dev.spacy
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","morphologizer","parser","ner","spancat","textcat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.morphologizer]
factory = "morphologizer"
extend = false
overwrite = true
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}

[components.morphologizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = ['ROOT1', 'ROOT2', 'PERSONAL', 'CHECK_BOX', 'Y/N']
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.17
pos_acc = 0.08
morph_acc = 0.08
morph_per_feat = null
dep_uas = 0.08
dep_las = 0.08
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
ents_f = 0.17
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
spans_sc_f = 0.17
spans_sc_p = 0.0
spans_sc_r = 0.0
cats_score = 0.17
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Solution

[components.spancat.spans_key] should be a string. You need to save all the spans under a single spans key like "sc". "sc" is the default and the easiest to get working with a default config.

You want to train the spancat component separately and then add it to an existing pipeline rather than trying to train from the combined config above. The steps would be:

Convert your data with all spans saved under doc.spans["sc"]
Create a config with spacy init config -p spancat -o accuracy (this will use en_core_web_lg vectors)
Train

Add the new spancat component to en_core_web_lg after replacing the spancat tok2vec listener with an internal tok2vec

nlp_spancat = spacy.load("spancat_model")
nlp_spancat.replace_listeners("tok2vec", "spancat", ["model.tok2vec"])
nlp_combined = spacy.load("en_core_web_lg")
nlp_combined.add_pipe("spancat", source=nlp_spancat)
nlp_combined.to_disk("combined_model")

You can do the same thing as above with spacy assemble and a config that sources all the components, but the steps above are simpler.