Search code examples
spacynamed-entity-recognition

NER - Should I include common prefixes in labeled entities


I am trying to recognize entities in a set of OCR texts from images of documents. Since the text is commonly in the form some_label: value in the document, it comes up often (but not always) in the OCR text as well.

My question is, say I am trying to annotate dates in my OCR text files, and 80% of times the date is in the format Date: xx/xx/xxxx; would it better if I ...

  1. Only marked xx/xx/xxxx as my date entity
    • Represents the true entity
    • Would be representative of 100% of the data
  2. OR marked the entire Date: xx/xx/xxxx as my date entity
    • Would take advantage of the commonly occurring Date: prefix for better accuracy?

Another example:

Amounts are commonly represented as $xxxxxx5.37 and $ 63.75

  1. Choose 5.37 and 63.75
  2. Choose $xxxxxx5.37 and $ 63.75 (taking advantage of $ sign)

Which of these would be the better practice to follow / lead to a better model?

(P.S.: I'm using Prodigy to annotate my data)


Solution

  • It depends on the neural network architecture you use.

    Let's assume you use spaCy v2 and its default neural architecture which is a CNN. In this case, the architecture is going to slide through your text according to a specific window (i.e. x number of words before the date entity and x number of words after the date entity).

    With this approach, every time the token Date: appears in the text, it is likely that the neural network will recognize that the entity date sits next to it.

    In this case, my suggestion would be to include only the annotate the xx/xx/xxxx date as an entity. It will give the model more flexibility in determining what is a date entity. However, testing is always the best way to find out what's best. So, give it a try :)