Search code examples
phptesseractghostscriptpdftk

detect pdf pages that are upside down


We are using php, pypdfocr, and pdftotext to OCR and extract text from documents that have been scanned in or faxed to us. The issue is when the document is scanned or faxed upside down or if some pages are intended to be read landscape (so the text is rotated 90 degrees on the page)

Things I have tried:

  • in tessdata cp eng.traineddata osd.traineddata

The resulting OCR text layer for pages which have 90 degree text isn't bad, however pages that are upside down, it OCR's each word and flips it in place so that if 'This is a test' appears in the document but upside down then the text layer may read 'test a is This'

If there is a way to detect that a page is upside down I can use pdftk to rotate the pages before I run it through the OCR (or i can remove the text layer if it was OCR'd and run it though the OCR again after using pdftk to rotate)

Any solution that can be executed from a linux CLI at this point is a viable solution.


Solution

  • You can get info about page orientation with tesseract (>=3.03 ?) easily. E.g.

    $ tesseract image.png -  -psm 0
    

    will produce this output

    Orientation: 3
    Orientation in degrees: 90
    Orientation confidence: 25.40
    Script: 1 
    Script confidence: 18.40
    

    Based on this information you can adjust image rotation. Example how to do it in python can be e.g. at script Fix image rotation with tesseract.