What is the best way to compute metrics for the transformers results?

Here is simple example of hugging face transformers for NER:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-large-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is jonathan davis and I live in Chicago, Illinois"

ner_results = nlp(example)
print(ner_results)

    output: [{'entity': 'B-PER', 'score': 0.95571744, 'index': 4, 'word': 'j', 'start': 11, 'end':
 12}, {'entity': 'B-PER', 'score': 0.6131773, 'index': 5, 'word': '##ona', 'start': 12, 'end': 
15}, {'entity': 'I-PER', 'score': 0.6707376, 'index': 6, 'word': '##than', 'start': 15, 'end':
 19}, {'entity': 'I-PER', 'score': 0.97754997, 'index': 7, 'word': 'da', 'start': 20, 'end': 22},
 {'entity': 'I-PER', 'score': 0.4608973, 'index': 8, 'word': '##vis', 'start': 22, 'end': 25}, 
{'entity': 'B-LOC', 'score': 0.9990302, 'index': 13, 'word': 'Chicago', 'start': 40, 'end': 47}]

For example I have information about my sentence:

jonathan davis - PER
Chicago - LOC
Illinois - LOC (The model did not recognize this entity)

How do I correctly calculate the Precision and Recall, given that my data is split as follows:

j, ##ona, ##than

Before that, I used regular expressions and used a metric, the point of which is described in this article. But I do not know if it is suitable for this task.

Please help me find the correct way to calculate the metric. Perhaps there are some built-in features in hugging-face I'm missing out on?

Solution

In my opinion, there is something wrong with the model (dslim/bert-large-NER) you're using. According to documents, they have introduced an argument named aggregation_strategy for the exact same purpose (full explanation).

But for some reason, this is not working properly here. Now there are two options for the quick fix

FIRST: Change the model to one which is working fine.

from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
sequence = "My name is jonathan davis and I live in Chicago, Illinois"
nlp(sequence)

output:

[{'end': 25,
  'entity_group': 'PER',
  'score': 0.9983611,
  'start': 10,
  'word': 'jonathan davis'},
 {'end': 47,
  'entity_group': 'LOC',
  'score': 0.9982808,
  'start': 39,
  'word': 'Chicago'},
 {'end': 57,
  'entity_group': 'LOC',
  'score': 0.99840826,
  'start': 48,
  'word': 'Illinois'}]

SECOND translate the output to a more comfortable format to do the rest of the process (probably with the aid of a state machine).