After running optical char recognition on some images, I get approximate text. Often the recognition is not great. For instance, the actual text "DATE" comes as "DHTE" or "0HTE". Basically I need to identify and extract the data in each line, so i don't want perfect recognition, just enough to identify the date line. I tried to calculate the Levenshtein edit distance, but unfortunately this tends to give similar values for DATE and TIME. At the moment, I'm trying to explore if I can match the data patterns using regular expressions instead.
Is there a method/algorithm to better the matching process? Fortunately, my set of words is not very large.
(i'm using tesseract for ocr and groovy/java for the algorithm)
This one has a few pretty cool algorithms http://secondstring.sourceforge.net/
This is a basic one in StringUtils levenstein distance