We have a document in the form of a collection of hundreds of scanned images. After doing OCR, we have saved the bounding boxes at character level, word level and line level with coordinates (essentially we get the character ones, we club them to form word bounding boxes and then line bounding boxes).
Our problem - user may search for text that spans characters, words, lines. We need to show search results where the query phrase, if present, should show up highlighted in the right page.
In this picture, we have all characters/words, the bounding boxes for all words, characters and line. Now user is searching for "was a really". Question is - what kind of data structure and algorithm we can use at runtime to show the search results quickly, with the entire search text highlighted as well?
Note: the phrase many span characters as well, like "his was a rea", or lines, like "deal. How abou" (let's suppose that the next line starts with "How about a new one").
Should we store only character level bounding boxes, and all three can be constructed from that? Or will storing all three levels help us with performances overall, even if it implies using more memory at runtime to load saved data from all three levels into memory for the search?
You say you have to "highlight the search". Given the requirements, you have to store the character bounding boxes because you need those for a correct highlighting at the character level.
The searches you describe are simple linear text searches, where the searched text is essentially one long string. You could use something like the Boyer-Moore or the Knuth-Morris-Pratt algorithm to quickly get the position where the searched text is (if any). If space is not too great a problem, you could also look into N-gram indexing.
If I misunderstood and you do not have the linear text -- do build it first time. Searching within a cloud of single letters is totally not worth the expense. This means that you'll have to manage small misalignments and add spaces as necessary; I did this frequently when working with PDFs, you can find some "algorithm" snippets and insights here, or feel free to ask more questions.
Assuming you have the search down pat, you now need to retrieve the bounding boxes quickly to perform the highlighting.
You can e.g. store the bounding boxes as quadruplets of coordinates; possibly you can play with data size to save space (for example, for the x and y coordinates 16 bits are almost certainly enough, and for w and h size eight bits are plenty; so 3 16-bits words (48 bits) can easily contain your bounding box.
The simplicity of this setup means that once you have a match for a, say, 22-character string at position 13,984,283 in the file, you can just multiply that offset by six and you'll get the offset in the bounding boxes database. Read 22*6 bytes from there, and you have all the characters ready in next to no time.
If you need performance, read no further. There are possible enhancements by storing indexes of known words, but if you want to be able to look for partial characters, chances are the average case isn't worth it. Optimizing the Boyer-Moore algorithm to speed up the search is also a possibility.
To save space, you can reduce the storage requirements and/or use compression.
By accepting a slight error, you could have 11 bits for y and x, and 5 bits for h and w, and store a complete bounding box in 32 bits.
You could also compress a "page" of bounding boxes using a delta algorithm, so that in most cases you'd store small values, that can fit in less bytes.
Suppose there are 300 scanned images comprising the document, pages ranking from 1 to 300. Each page yields a bunch of text, together combined we have whole text of the document. Each page has bounding boxes for each line, word and character in it. When somebody searches for "hello world" and there are eleven matches in the whole document text, with start and end indices, how to map them to page number and bounding box? Mapping word indices to their box coordinates?
It goes like this:
You build the page index once and for all, when you do the OCR: you know that page0 yields 31829 characters, page1 32118 and so on, so you tally 0-31828 to page 0, 31829-63946 to page 1 etc., so when you request character 172035, by examining the list
131283 162112 page17.jpg
162113 184773 page18.jpg
you know that for the first match, you have to display page18.jpg
.
All the bounding boxes are stored one after the other in the box file, as structures of two int16's and two int8's. Each structure occupies 6 bytes, so the bounding box for character 172035 starts at offset 172035*6 from the beginning of the file, and there we find
offset y x w h
1032210 812 238 12 19
1032216 812 251 11 18
1032222 836 12 13 19
... so we load page18.jpg, and we start highlighting squares, one of 12x19 pixels at 812x238, the next of 11x18 at 812x251, and so on.
The time required is next to none - fseek() on the box file, you know the matched text is, say, 18 characters, so you need to read 18 bounding boxes, and that is just 108 bytes.
In some scenarios you might want to not display page18.jpg or store it at all, but rather rebuild it (possibly client side) from the text and the bounding boxes, even if it's time-consuming. In our example, at about 30K characters per page, a page is described by 30K*6 numbers and 30K characters. Even transmitted in JSON, that's about 600K, and e.g. in modern browsers it automatically and transparently compresses to about 100K or even less:
{"text":"There was Eru, the One...", "bbox":[{12,17,11,19},{...}]}
To save space, you could divide, index and compress the bounding box database. Chances are that just a simple gzip compression will get you a 8x reduction in size on average. So, if you have one gigabyte TXT file, you have a 6-Gb binary file with the bounding boxes. Divide it in 16 Kb chunks, each chunk will compress to a 2-3 Kb block. Write it to a compressed file, and save its 64-bit offset into an index file. You'll end up with some 400 thousand blocks, about 800 Mb of space, and a 3 Mb index.
Say you need bytes at offset 3,373,256,576: divide that by the 16Kb block size, and you get they're in compressed block 205887 at offset 3968 (sometimes you'll need two adjacent blocks). Read the 8 bytes at offset 205887*8 in the index file and get the address of the compressed block in the large binary file. Start decompressing there until you get 16K worth of data, seek to offset 3968 and you're there. You now required one read from the index file and one large 2-3 Kb read from the binary file, but the whole thing occupies less than 2 Gb instead of 7.