Search code examples
groovyocrtesseractfuzzy-comparison

Fuzzy string match


After running optical char recognition on some images, I get approximate text. Often the recognition is not great. For instance, the actual text "DATE" comes as "DHTE" or "0HTE". Basically I need to identify and extract the data in each line, so i don't want perfect recognition, just enough to identify the date line. I tried to calculate the Levenshtein edit distance, but unfortunately this tends to give similar values for DATE and TIME. At the moment, I'm trying to explore if I can match the data patterns using regular expressions instead.

Is there a method/algorithm to better the matching process? Fortunately, my set of words is not very large.

(i'm using tesseract for ocr and groovy/java for the algorithm)


Solution

  • This one has a few pretty cool algorithms http://secondstring.sourceforge.net/

    This is a basic one in StringUtils levenstein distance