Search code examples
machine-learningspacynamed-entity-recognition

What is the good metric to evaluate NER model trained in Spacy


I have 3000 manually labeled data set, divided into train and test set I have trained the NER model using SpaCy, to extract 8 custom entities like "ACTION", HIRE-DATE, STATUS etc... To evaluate the model I am using SpaCy Scorer.

There is no Accuracy metrics in the output, I am not sure which metric should I consider to decide whether the model performance is Good or Bad?

There are couple of cases where precision is low but the recall is 100 and f1 is also low eg:

'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},

in the above case what should be our conclusion?

Following is the full result of the Scorer, Where p=precision, r=recall and f=F1 score.... it has got overall performance and Entity wise performance.

{
'uas': 0.0,
 'las': 0.0,
 'ents_p': 86.40850417615793,
 'ents_r': 97.93459552495698,
 'ents_f': 91.81121419927389,
 'ents_per_type': {'ACTION': {'p': 97.17682020802377,
   'r': 97.61194029850746,
   'f': 97.3938942665674},
  'STATUS': {'p': 83.33333333333334,
   'r': 96.3855421686747,
   'f': 89.3854748603352},
  'PED': {'p': 98.61751152073732,
   'r': 99.53488372093024,
   'f': 99.07407407407408},
  'TERM-DATE': {'p': 83.52272727272727,
   'r': 98.65771812080537,
   'f': 90.46153846153847},
  'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},
  'DOB': {'p': 10.0, 'r': 100.0, 'f': 18.181818181818183},
  'RE-HIRE-DATE': {'p': 34.84848484848485,
   'r': 100.0,
   'f': 51.685393258426956},
  'HIRE-DATE': {'p': 18.96551724137931, 'r': 100.0, 'f': 31.88405797101449},
  'PED-CED': {'p': 100.0, 'r': 71.42857142857143, 'f': 83.33333333333333},
  'CED': {'p': 100.0, 'r': 100.0, 'f': 100.0}},
 'tags_acc': 0.0,
 'token_acc': 100.0}

Kindly Suggest.


Solution

  • It depends on your application. What's worse: missing an entity, or wrongly flagging something as an entity? If failing to label an entity (false negative) is bad, then you care about recall. If wrongly flagging a non-entity as an entity (false positive) is bad, you care about precision. If you care about both precision and recall the same, use F_1. If you care about precision (false positives) twice as much as recall (false negatives), use F_0.5. You can do F_b for any b to express what you care about. The formula is shown and explained on the Wikipedia page for F Score

    Edit: answering the direct question from the original post:

    The system does badly at LOCATION and the 3 date entities. The others look good. If it were me, I would try to use NER to extract all dates as one entity, then try to build a separate system, rule based or a classifier, for distinguishing between the different kinds of dates. For location, you could use a system that focuses on just geo-parsing, such as Mordecai.