I always don't know how to evaluate a task for tagging including POS tagging or any other sequence tagging. I especially don't know how to calculate the Precision, Recall and F1 score of those tasks. I then found there is a script named conlleval.perl and we can directly use it for evaluating. But I don't know perl language and I still confused how P, R, F1 calculated in tagging tasks. Is there anyone can tell me?
There is a simple definition in a book Spoken Language Understanding: Systems for Extracting Semantic Information from Speech (by Gokhan Tur, Renato De Mori), chapter 3.1.5 Evaluation metrics:
Precision = # of reference slots correctly detected by SLU / # of total slots detected by SLU
Recall = # of reference slots correctly detected by SLU / # of total reference slots
F1 = 2 x Precision x Recall / (Precision + Recall)
Note: for overall metrics conlleval uses micro averaging.