machine-learningnlpmodelmetricsevaluation

F1-Score and Accuracy for Text-Similarity

I am trying to understand how to calculate F1-Score and accuracy between texts while fine-tuning a QA model.

Let's assume we have this:

`labels = [I am fine, He was born in 1995, The Eiffel tower, dogs]`

`preds = [I am fine, born in 1995, Eiffel, dog]`

In this case, it is clear that the predictions are pretty accurate, but how can I measure the F1-Score here? Dog and dogs are not an exact match, but they are very similar.

Solution

• One popular metric for text similarity is the Levenshtein distance or edit distance, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.

Try implementing below code. Adjust `threshold` as per your requirement.

``````import Levenshtein

def text_similarity_evaluation(labels, preds, threshold=0.8):
tp, fp, fn = 0, 0, 0

for label, pred in zip(labels, preds):
similarity_score = 1 - Levenshtein.distance(label, pred) / max(len(label), len(pred))
if similarity_score >= threshold:
tp += 1
else:
fp += 1

fn = len(labels) - tp

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)

return precision, recall, f1_score

# Example usage
labels = ["I am fine", "He was born in 1995", "The Eiffel tower", "dogs"]
preds = ["I am fine", "born in 1995", "Eiffel", "dog"]

precision, recall, f1_score = text_similarity_evaluation(labels, preds, threshold=0.8)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1_score)
``````