F1-Score and Accuracy for Text-Similarity

I am trying to understand how to calculate F1-Score and accuracy between texts while fine-tuning a QA model.

Let's assume we have this:

labels = [I am fine, He was born in 1995, The Eiffel tower, dogs]

preds = [I am fine, born in 1995, Eiffel, dog]

In this case, it is clear that the predictions are pretty accurate, but how can I measure the F1-Score here? Dog and dogs are not an exact match, but they are very similar.


  • One popular metric for text similarity is the Levenshtein distance or edit distance, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.

    Try implementing below code. Adjust threshold as per your requirement.

    import Levenshtein
    def text_similarity_evaluation(labels, preds, threshold=0.8):
        tp, fp, fn = 0, 0, 0
        for label, pred in zip(labels, preds):
            similarity_score = 1 - Levenshtein.distance(label, pred) / max(len(label), len(pred))
            if similarity_score >= threshold:
                tp += 1
                fp += 1
        fn = len(labels) - tp
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1_score = 2 * (precision * recall) / (precision + recall)
        return precision, recall, f1_score
    # Example usage
    labels = ["I am fine", "He was born in 1995", "The Eiffel tower", "dogs"]
    preds = ["I am fine", "born in 1995", "Eiffel", "dog"]
    precision, recall, f1_score = text_similarity_evaluation(labels, preds, threshold=0.8)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1-Score:", f1_score)