I am trying to recognize entities in a set of OCR texts from images of documents. Since the text is commonly in the form some_label: value
in the document, it comes up often (but not always) in the OCR text as well.
My question is, say I am trying to annotate dates in my OCR text files, and 80% of times the date is in the format Date: xx/xx/xxxx
; would it better if I ...
xx/xx/xxxx
as my date entity
Date: xx/xx/xxxx
as my date entity
Date:
prefix for better accuracy?Amounts are commonly represented as $xxxxxx5.37
and $ 63.75
5.37
and 63.75
$xxxxxx5.37
and $ 63.75
(taking advantage of $ sign)Which of these would be the better practice to follow / lead to a better model?
(P.S.: I'm using Prodigy to annotate my data)
It depends on the neural network architecture you use.
Let's assume you use spaCy v2 and its default neural architecture which is a CNN.
In this case, the architecture is going to slide through your text according to a specific window (i.e. x number of words before the date
entity and x number of words after the date
entity).
With this approach, every time the token Date:
appears in the text, it is likely that the neural network will recognize that the entity date
sits next to it.
In this case, my suggestion would be to include only the annotate the xx/xx/xxxx date as an entity. It will give the model more flexibility in determining what is a date entity
. However, testing is always the best way to find out what's best. So, give it a try :)