Search code examples
javascriptnode.jsvb6ocrmodi

Convert OCRed unstructured text into proper text


I am using Microsoft MODI in VB6 to OCR an image. (I know about other OCR tools like tesseract etc but I find MODI more accurate than other)

The image to be OCRed is like this

enter image description here

and, the text the I get after OCR is like below

Text1
Text2
Text3
Number1
Number2
Number3

The problem here is that corresponding text from opposite column is not maintained. How can I map Number1 with Text1?

I can only think of a solution like this.

MODI provides co-ordinates of all the OCRed words like this

LeftPos = Img.Layout.Words(0).Rects(0).Left
TopPos = Img.Layout.Words(0).Rects(0).Top

So to align words in same line, we can match TopPos of each word and then sort them by LeftPos. We will get the complete line. So I looped through all the words and stored their text as well as left and top in a mysql table. then ran this query

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ')
FROM test_copy
GROUP BY `top`

My problem is, That Top positions are not exact same for each word, Obviously there will be couple of pixel differences.

I tried adding DIV 5, for merging words that are in 5 pixels range but that doesn't work for some cases. I also tried doing it in node.js by calculating tolerance for each word and then sorting by LeftPos but I still feel this is not the best way to do it.

Update: The js code does the job but except for the case where Number1 has 5 pixel difference and Text2 has no corresponding in that line.

Is there any better idea to do this?


Solution

  • I'm not 100% sure how you identify those words that are in your "left" column, but once you have that word identified you can find other words in it line by projecting not just the Top coordinate but the the whole rectangle across (both top and bottom). Determine the overlap (intersection) with the other words. Note the area marked in red below.

    Horizontal projection

    This is the tolerance you can use to detect if something is in the same line. If something overlaps by only a pixel then it is probably from a lower or higher line. But if it overlaps by, say, 50% or more of the height `Text1, then it is likely on the same line.


    Example SQL to find all words in the "line" based on atop and bottom coord

    select 
        word.id, word.Top, word.Left, word.Right, word.Bottom 
    from 
        word
    where 
        (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom)
        or (word.Bottom >= @leftColWordTop  and word.Bottom <= @leftColWordBottom)
    

    Example psuedo VB6 code to calculate the lines as well.

    'assume words is a collection of WordInfo objects with an Id, Top, 
    '   Left, Bottom, Right properties filled in, and a LineAnchorWordId 
    '   property that has not been set yet.
    
    'get the words in left-to-right order
    wordsLeftToRight = SortLeftToRight(words) 
    
    'also get the words in top-to-bottom order
    wordsTopToBottom = SortTopToBottom(words) 
    
    'pass through identifying a line "anchor", that being the left-most 
    '   word that starts (and defines) a line
    for each anchorWord in wordsLeftToRight
    
        'check if the word has been mapped to aline yet by checking if 
        '   its anchor property has been set yet.  This assumes 0 is not 
        '   a valid id, use -1 instead if needed
        if anchorWord.LineAnchorWordId = 0 then 
    
            'not locate every word on this line, as bounded by the 
            '   anchorWord.  every word determined to be on this line 
            '   gets its LineAnchorWordId property set to the Id of the 
            '   anchorWord
            for each lineWord in wordsTopToBottom
    
                if lineWord.Bottom < anchorWord.Top Then
    
                    'skip it,it is above the line (but keep searching down
                    '   because we haven't reached the anchorWord location yet)
    
                else if lineWord.Top > anchorWord.Bottom Then
    
                    'skip it,it is below the line, and exit the search 
                    '   early since all the rest will also be below the line
                    exit for
    
                else if OverlapsWithinTolerance(anchorWord, lineWord) then
    
                    lineWord.LineAnchorWordId = anchorWord.Id
    
                endif
    
            next
    
        end if
    
    next anchorWord
    
    'at this point, every word has been assigned a LineAnchorWordId, 
    '   and every word on the same line will have a matching LineAnchorWordId
    '   value.  If stored in a DB you can now group them by LineAnchorWordId 
    ' and sort them by their Left coord to get your output.