Search code examples
pdfocr

Definite OCR Quality


Definite

I have a low-quality English PDF file with no image, no table, single-column and completely black and white - not even gray. I used ABBYY FineReader and it detected the text just fine and I can search inside it.

But I need a true quality in order to print it. Just like vector images. What software should I use?


Solution

  • Reversing Images to text is a tough programming challenge. First you need to ensure image is fit for OCR that can involve several graphics libraries to enhance a skewed or poor contrast image.

    Then ANY OCR will generally do but they are "Word Based", so need a defined language dictionary to convert inked shapes into letters, then those letters into words. See how each word is independent and wrong scale.

    Then you need to edit the letter shapes into a constant position and line height. Like the middle image, this can be very labour intensive.

    This is where you choose fonts, colour, scaling and style like in a Word Processor.

    Finally delete the image is a single one line command.

    enter image description here

    ABBYY is definitely at the top of the league.
    Thus your programming just needs to continue to be based around that, with good image pre-processing.

    If necessary (after manual corrections) a post process of removing the images can be as simple as a quick pass through GhostScript keeping the text and discarding all background images.

    Note if there are other embedded part images those would be removed by GhostScript too! So a different approach would be needed. It would require returning clipped areas in a final stage.