I am trying to predict entities using a custom trained NER model using spacy. I read https://github.com/explosion/spaCy/pull/8855 that confidence scores of each entity can be obtained using spancat. But I have a little confusion regarding to make that work. According to my understanding, we have to train a pipeline using spancat component. So while training, within the config file there is a segment,
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
Should we have to change this to
[nlp]
lang = "en"
pipeline = ["tok2vec","ner","spancat"]
batch_size = 1000
for the spancat to work.
Then after training, while predicting the entities from unknown text, should we have to use
doc = nlp(data_to_be_predicted)
spans = doc.spans["spancat"] # SpanGroup
print(spans.attrs["scores"]) # list of numbers, span length as SpanGroup
to get the confidence scores.
I am using spacy 3.1.3. I believe according to the documentation, this feature is rolled out by now.
I'm not really sure there's a question in your post, but yes, the spancat is available and you can get entity scores from it.
The spancat is a different component from the ner component. So if you do this:
pipeline = ["tok2vec","ner","spancat"]
The spancat will not add scores for things your ner component predicted. You probably want to remove the ner component.
About usage, please see the docs and the example project. This is how you get the score:
doc = nlp(text)
span_group = doc.spans["spans"] # default key, can be changed
scores = span_group.attrs["scores"]
# Note that `scores` is an array with one score for each span in the group
for span, score in zip(span_group, scores):
print(score, span)