Search code examples
pdfrotationtransformationrotatetransformimage-extraction

How to extract rotation/transformation information for PDF extracted images (i.e. How does viewers know to rotate 180 )


I am using a ScanSnap scanner which generates PDF-1.3 where it will auto-correct the orientation (rotate 0 or 180 degrees) of scanned documents when the PDF is viewed within Adobe Reader. OCR is done by the scanning software and I am assuming the orientation is determined then and encoded into the PDF.

Note that I know I can use Tesseract or other OCR tools to determine if rotation is needed, but I do not want to use it as the scanner software seems to have already determined it and telling PDF viewers if rotation is needed (or not).

When I use image extraction tools (like xpdf pdfimages, python libraries) it does not properly rotate jpeg images 180 degrees (if needed).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

I have scanned a document twice with rotation (0 degrees, and 180 degrees). I cannot seem to reverse engineer what is telling Adobe/Foxit to rotate (or not) the image when viewing. I have looked at the PDF-1.3 specification doc, and compared the PDF binary data between the orientation-corrected and not-corrected. I can not determine what is correcting the orientation?

  • No /Page/Rotate (defaults to 0) in PDF
  • No EXIF orientation in JPEG
  • I do not see any transformation matrix (cm operator) in PDF

In both cases the PDF binary looks like the following (stopped at the JPEG streamed data)

UPDATED: links to PDF files rotated-180 rotated-0

%PDF-1.3
%âãÏÓ
1 0 obj
<</Metadata 20 0 R/Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</MediaBox[0.0 0.0 606.6 794.88]/Count 1/Type/Pages/Kids[4 0 R]>>
endobj
4 0 obj
<</Parent 2 0 R/Contents 18 0 R/PieceInfo<</PSL<</Private<</V(3.2.9)>>/LastModified(D:20190201125524-00'00')>>>>/MediaBox[0.0 0.0 606.6 794.88]/Resources<</XObject<</Im0 5 0 R>>/Font<</C0_0 11 0 R/T1_0 16 0 R>>/ProcSet[/PDF/Text/ImageC]>>/Type/Page/LastModified(D:20190201085524-04'00')>>
endobj
5 0 obj
<</Subtype/Image/Length 433576/Filter/DCTDecode/Name/X/BitsPerComponent 8/ColorSpace/DeviceRGB/Width 1685/Height 2208/Type/XObject>>stream

Does anyone know how PDF viewers know to rotate an image 180 (or not). Is it meta-data within the PDF or JPEG image which can be extracted? Does Adobe and other viewers do something dynamically on opening a document to determine if orientation correction is needed?

I'm no expert with PDF specification. But I was hoping someone may have already found a solution to this problem.


Solution

  • The image Im0 in the resources of the page in "internetfile-180.pdf" is not rotated:

    internetfile-180.pdf image

    But the image Im0 in the resources of the page in "internetfile.pdf" is rotated:

    enter image description here

    In the viewer both look upright, so in "internetfile.pdf" a technique must be used that rotates the image.

    There are two major techniques for this:

    • Setting the Rotate property of the page accordingly, i.e. here to 180.
    • Applying a rotation transformation to the current transformation matrix in the content stream of the page.

    Let's look at the page dictionary first, a bit pretty-printed:

    4 0 obj
    <<
      /Parent 2 0 R
      /Contents 13 0 R
      /PieceInfo
      <<
        /PSL
        <<
          /Private <</V (3.2.9)>>
          /LastModified (D:20190204142537-00'00')
        >>
      >>
      /MediaBox [0.0 0.0 608.64 792.24]
      /Resources
      <<
        /XObject <</Im0 5 0 R>>
        /Font <</T1_0 11 0 R>>
        /ProcSet [/PDF /Text /ImageC]
      >>
      /Type /Page
      /LastModified (D:20190204102537-04'00')
    >> 
    

    As we see, there is no Rotate entry present. Thus, we'll have to look at the page content stream. According to the page dictionary it's in object 13, generation 0.

    That object is a stream object with deflated stream data:

    13 0 obj
    <<
      /Length 4014
      /Filter /FlateDecode
    >>
    stream
    H‰”WÛŽÛF}Ÿ¯Ð[lÀÓÓ÷˾e½
    [...]
    ÿüòÛÿ ´ß
    endstream
    endobj 
    

    After inflating the stream data, they start like this:

    q
    -608.3999939 0 0 -792.9600067 608.3999939 792.9600067 cm
    /Im0 Do
    Q
    [...]
    

    And this is indeed an application of the second technique, the cm instruction applies the rotation and the Do instruction paints the image with the rotation active!

    In detail, the cm instruction applies the affine transformation represented by the matrix

    -608.3999939    0            0
       0         -792.9600067    0
     608.3999939  792.9600067    1
    

    In other words:

    x' = -608.3999939 * x + 608.3999939
    y' = -792.9600067 * y + 792.9600067
    

    This transformation actually is a combination of a rotation by 180°, a horizontal scaling by 608.3999939 and a vertical scaling by 792.9600067, and a translation by 608.3999939 horizontally and 792.9600067 vertically.

    The Do instruction now paints the image. Here one needs to know that this instruction first scales the image to fit into the unit 1×1 square at the origin and then applies the current transformation matrix.

    Thus, the image is drawn rotated by 180°, effectively filling the whole 608.64×792.24 MediaBox of the page.