Search code examples
image-processingimagemagickphotoshopphotoshop-cs4

How to recognize Text-Presence pattern in a scanned image and crop it?


Smart Cropping for Scanned Docs

Recently I took over a preservation project of old books/manuscripts. They are huge in quantity, almost 10,000 pages. I had to scan them manually with a portable scanner as they were not in a condition to be scanned in an automated book scanner.

The real problem shows up when I start editing them in Photoshop. Note that all of them are basically documents (in JPG format) and that there are absolutely no images in those documents. They are in a different language (Oriya) for which I am sure there won't be any OCR software available in near future. (If there is please let me know.)

To make those images (docs) look clean and elegant I have to crop them, position them, increase contrast a bit, clean unnecessary spots with eraser, et cetera. I was able to automate most of these processes in Photoshop, but cropping is the point where I am getting stuck. I can't automate cropping as the software can't recon the presence of text or content in a certain area of that img (doc); it just applies the value given to it for cropping.

I want a solution to automate this cropping process. I have figured out an idea for this, I don't know if it's practical enough to implement and as far as I know there's no software present in market that does this kind of thing.

The possible solution to this: This might be possible if a tool can recognize the presence of text in an image (that's not very critical as all of them are normal document images, no images in them, no patterns just plain rectangles) and crop it out right from the border of those text from each side so it can output a document image without any margin. After this rest of the tasks can be automated using Photoshop such as adding white spaces for margin, tweaking with the contrast and color make it more readable etc.

Here is an album link to the gallery. I can post more sample images if it would be useful - just let me know.

http://imageshack.us/g/1/9800204/

Here is one example from the bigger sample of images available through above link:

one example of a bigger set...


Solution

  • We addressed many "smart cropping" issues in our open-source DjVu->PDF converter. The converter also allows you to load a set of scanned images instead of DjVu (just press SHIFT with Open command) and output a resulting set of images instead of PDF.

    It is a free cross-platform GUI tool, written in Java.

    image converter, smart crop and deskew