Search code examples
pdfimage-processingimagemagickpixeldimensions

Are PDF dimensions meaningful when measured in pixels?


I've been testing two ways of measuring the dimensions of PDFs in pixels using two Python modules - Wand (ImageMagick bindings) and GDAL.

Getting the dimensions of this PDF returns completely different results with each module:

  • Wand reports 556x748
  • GDAL reports 2317x3117.

Is one of these 'wrong'? If I understand correctly PDF dimensions in pixels are device dependent - however, the two results use the same display device.

Are there other factors that can affect the calculation of PDF size in pixels?


Solution

  • 'Wand' and 'GDAL' are not made to process PDFs

    If you rely on Wand (ImageMagick bindings) to process PDFs, you are not using ImageMagick as you may imagine.

    Because ImageMagick cannot process PDFs by itself -- it only processes raster images.

    For other formats IM has to rely on 'delegates'. ImageMagick delegates are external, third party utilities which get run by ImageMagick to convert 'foreign' file formats into raster images first -- which then get passed to ImageMagick to do the further work.

    So even if you only want to determine the dimensions of PDF pages by using ImageMagick, this is not as a straightforward process as one would like:

    1. Call Ghostscript to render the PDF pages into a raster image. (Do you know which resolution Ghostscript will use to create the rasters?!?)

    2. Run some ImageMagick command to return the dimensions of the GS-created raster image(s) in 'pixels'.

    This can take a very looooong time to return the results -- and the results are dependent on the resolution chosen when rasterizing the PDF pages.

    It's the wrong tool for the job...

    (The same as above is basically true for GDAL, even though it doesn't use Ghostscript for the rasterization. But do you know which default resolution GDAL uses when it converts the vector PDF pages to raster?!?)

    Use the right tool for the job

    PDFs store the dimensions for all pages in a 'dictionary' with the key /MediaBox. This key MUST be present in all valid PDF files.

    Be aware that PDFs also know the (optional) concepts of /CropBox, /ArtBox, /TrimBox and /BleedBox. The /CropBox key value, if present, may order the PDF viewer to hide parts of the complete page and show only a smaller viewport box of it (when printing or viewing).

    One command line tool to determine the PDF page dimensions is pdfinfo. This utility is based on the Poppler library -- so if you do not want to run an external command, bind your own application to this lib.

    pdfinfo is much faster:

    1. It does not need to render or rasterize or fully interpret the PDF file.

    2. It simply does a (very fast) lookup of the dictionary entries for the dimensions.

    3. These dimensions are returned in points. This unit originates from the PostScript world: 72 points are equivalent to 1 inch. So at a resolution 72 DPI/PPI it also would show you the "dimensions in pixels" too...

    Example (using linked PDF from OP)

    I've run a pdfinfo command against your linked example PDF to determine the dimensions of the page range 116-117 (using -f for first and -l for the last pages of the range). The command completed in fractions of a second:

    Here are the results:

    pdfinfo -f 116 -l 117 -box soils-of-manawatu-county-soil-survey-report-30.pdf
    
     Title:          
     Subject:        
     Keywords:       
     Author:         
     Creator:        ABBYY FineReader
     Producer:       
     CreationDate:   Tue Dec 18 19:11:50 2007
     ModDate:        Tue Dec 18 19:11:50 2007
     Tagged:         no
     UserProperties: no
     Suspects:       no
     Form:           none
     JavaScript:     no
     Pages:          117
     Encrypted:      no
     Page  116 size: 957 x 751 pts
     Page  116 rot:  0
     Page  117 size: 2065 x 2249 pts
     Page  117 rot:  0
     Page  116 MediaBox:     0.00     0.00   957.00   751.00
     Page  116 CropBox:      0.00     0.00   957.00   751.00
     Page  116 BleedBox:     0.00     0.00   957.00   751.00
     Page  116 TrimBox:      0.00     0.00   957.00   751.00
     Page  116 ArtBox:       0.00     0.00   957.00   751.00
     Page  117 MediaBox:     0.00     0.00  2065.00  2249.00
     Page  117 CropBox:      0.00     0.00  2065.00  2249.00
     Page  117 BleedBox:     0.00     0.00  2065.00  2249.00
     Page  117 TrimBox:      0.00     0.00  2065.00  2249.00
     Page  117 ArtBox:       0.00     0.00  2065.00  2249.00
     File size:      2105582 bytes
     Optimized:      yes
     PDF version:    1.2
    

    As you can see, your PDF does not even have identical page dimensions for each of its 117 pages!

    Now let's try the same with an ImageMagick command: ([1])

    identify              \
      -format "%W x %H\n" \
       soils-of-manawatu-county-soil-survey-report-30.pdf[115-116]
    

    ([1] Note: ImageMagick's page numbering method is zero-based {first page has number '0'} -- hence the [115-116] range for pages 116-117.)

    This takes 6 seconds to complete, and returns:

    957 x 751
    2065 x 2249
    

    I've been lucky here, because Ghostscript seems to have been run with a parameter for resolution that equals -r72x72.

    I've seen cases where ImageMagick was set up to use -r75x75 -- which would of course return different values!

    Example using another PDF

    The next examples are done with a PDF that represents the User Manual for an IXUS 850 IS camera, as found on the web. I'll retrieve info for the first 3 pages only:

    pdfinfo -box -l 3 _IXUS_850IS_ADVCUG_EN.pdf
    
     Creator:        FrameMaker 6.0
     Producer:       Acrobat Distiller 5.0.5 (Windows)
     CreationDate:   Thu Aug 17 16:43:06 2006
     ModDate:        Tue Aug 22 12:20:24 2006
     Tagged:         no
     UserProperties: no
     Suspects:       no
     Form:           AcroForm
     JavaScript:     no
     Pages:          146
     Encrypted:      no
     Page    1 size: 419.535 x 297.644 pts
     Page    1 rot:  90
     Page    2 size: 297.646 x 419.524 pts
     Page    2 rot:  0
     Page    3 size: 297.646 x 419.524 pts
     Page    3 rot:  0
     Page    1 MediaBox:     0.00     0.00   595.00   842.00
     Page    1 CropBox:     87.25   430.36   506.79   728.00
     Page    1 BleedBox:    87.25   430.36   506.79   728.00
     Page    1 TrimBox:     87.25   430.36   506.79   728.00
     Page    1 ArtBox:      87.25   430.36   506.79   728.00
     Page    2 MediaBox:     0.00     0.00   595.00   842.00
     Page    2 CropBox:    148.17   210.76   445.81   630.28
     Page    2 BleedBox:   148.17   210.76   445.81   630.28
     Page    2 TrimBox:    148.17   210.76   445.81   630.28
     Page    2 ArtBox:     148.17   210.76   445.81   630.28
     Page    3 MediaBox:     0.00     0.00   595.00   842.00
     Page    3 CropBox:    148.17   210.76   445.81   630.28
     Page    3 BleedBox:   148.17   210.76   445.81   630.28
     Page    3 TrimBox:    148.17   210.76   445.81   630.28
     Page    3 ArtBox:     148.17   210.76   445.81   630.28
     File size:      6888764 bytes
     Optimized:      yes
     PDF version:    1.4
    

    As one can see from the output all three page sizes ("/MediaBox") are 595 x 842 pts (==A4), but the different /CropBox entries restrict the visible parts of the pages to view ports of these sizes:

    1. Page 1: 419.535 x 297.644 pts
    2. Page 2: 297.646 x 419.524 pts
    3. Page 3: 297.646 x 419.524 pts

    On top of that, the first page is rotated by 90 degrees (as can be seen from the line saying Page 1 rot: 90).

    Now let's compare what my ImageMagick command ([2]) returns:

    identify -format "%W x %H\n" _IXUS_850IS_ADVCUG_EN.pdf[0-2]
    
     842 x 595
     595 x 842
     595 x 842
    

    ([2] Note: The IM on my system is a 6.9.0-0 Q16 versio, which utilizes a Ghostscript v9.10 as a delegate. If you test the same thing on a different system with other IM/GS versions, your output may be different!)

    So this last example may answer the "Are there other factors that can affect the calculation of PDF size in pixels?" part of the OP question.