nlp artificial-intelligence named-entity-recognition measurement

Measuring F1-score for NER

I am trying to evaluate a model of artificial intelligence for NER (Named Entity Recognition).
In order to compare with other benchmarks, I need to calculate the model's F1-score. However, I am unsure how to code this.

My idea was:
True-positives: equal tokens and equal tags, true-positive for the tag
False-negative: equal tokens and unequal tags or token did not appear in the prediction, false-negative for the tag
False-positive: token does not exist but has been assigned to a tag, example:

Phrase: "This is a test"
Predicted: {token: This is, tag: WHO}
True pairs: {token: This, tag: WHO} {token: a test, tag: what}
In this case, {token: This is, tag: WHO} is considered as a false positive of WHO.

The code:

       for val predicted tokens (pseudo-code) {   
       // val = struct { tokens, tags } from a phrase
           for (auto const &j : val.tags) {
                if (j.first == current_tokens) {
                    if (j.second == tag) {
                        true_positives[tag_id]++;
                    } else {
                        false_negatives[tag_id]++;
                    }
                    current_token_exists = true;
                }
                
            }
            if (!current_token_exists) {
                false_positives[tag_id]++;
            }
        }

        for (auto const &i : val.tags) {
            bool find = 0;
            for (auto const &j : listed_tokens) {
                if (i.first == j) {find = 1; break;}
            }
            if (!find) {
                false_negatives[str2tag_id[i.second]]++;
            }
        }

After this, calculate the F-1:

    float precision_total, recall_total, f_1_total;
    precision_total = total_true_positives / (total_true_positives + total_false_positives);
    recall_total = total_true_positives / (total_true_positives + total_false_negatives);
    f_1_total = (2 * precision_total * recall_total) / (precision_total + recall_total);

However, I believe that I am wrong in some concept. Does anyone have an opinion?

Solution

This is not a complete answer. Taking a look here we can see that there are many possible ways of defining an F1 score for NER. There are consider at least 6 possible cases, a part of TP, TN, FN, and FP, since the tag can correspond to more than one token, and therefore we may consider the partial matches. If you take a look there are different ways of defining the F1 score, some of them defining the TP like a weighted average of strict positive and partial positive, for example. CoNLL, which is one of the most famous benchmarks for NER looks like they use an strict definition for recall and precission, which is enough to define the F1 score:

precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.