I am trying to understand how to calculate F1-Score and accuracy between texts while fine-tuning a QA model.
Let's assume we have this:
labels = [I am fine, He was born in 1995, The Eiffel tower, dogs]
preds = [I am fine, born in 1995, Eiffel, dog]
In this case, it is clear that the predictions are pretty accurate, but how can I measure the F1-Score here? Dog and dogs are not an exact match, but they are very similar.
One popular metric for text similarity is the Levenshtein distance or edit distance, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.
Try implementing below code. Adjust threshold
as per your requirement.
import Levenshtein
def text_similarity_evaluation(labels, preds, threshold=0.8):
tp, fp, fn = 0, 0, 0
for label, pred in zip(labels, preds):
similarity_score = 1 - Levenshtein.distance(label, pred) / max(len(label), len(pred))
if similarity_score >= threshold:
tp += 1
else:
fp += 1
fn = len(labels) - tp
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)
return precision, recall, f1_score
# Example usage
labels = ["I am fine", "He was born in 1995", "The Eiffel tower", "dogs"]
preds = ["I am fine", "born in 1995", "Eiffel", "dog"]
precision, recall, f1_score = text_similarity_evaluation(labels, preds, threshold=0.8)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1_score)