I've been testing two ways of measuring the dimensions of PDFs in pixels using two Python modules - Wand (ImageMagick bindings) and GDAL.
Getting the dimensions of this PDF returns completely different results with each module:
Is one of these 'wrong'? If I understand correctly PDF dimensions in pixels are device dependent - however, the two results use the same display device.
Are there other factors that can affect the calculation of PDF size in pixels?
If you rely on Wand (ImageMagick bindings) to process PDFs, you are not using ImageMagick as you may imagine.
Because ImageMagick cannot process PDFs by itself -- it only processes raster images.
For other formats IM has to rely on 'delegates'. ImageMagick delegates are external, third party utilities which get run by ImageMagick to convert 'foreign' file formats into raster images first -- which then get passed to ImageMagick to do the further work.
So even if you only want to determine the dimensions of PDF pages by using ImageMagick, this is not as a straightforward process as one would like:
Call Ghostscript to render the PDF pages into a raster image. (Do you know which resolution Ghostscript will use to create the rasters?!?)
Run some ImageMagick command to return the dimensions of the GS-created raster image(s) in 'pixels'.
This can take a very looooong time to return the results -- and the results are dependent on the resolution chosen when rasterizing the PDF pages.
It's the wrong tool for the job...
(The same as above is basically true for GDAL, even though it doesn't use Ghostscript for the rasterization. But do you know which default resolution GDAL uses when it converts the vector PDF pages to raster?!?)
PDFs store the dimensions for all pages in a 'dictionary' with the key /MediaBox
. This key MUST be present in all valid PDF files.
Be aware that PDFs also know the (optional) concepts of /CropBox
, /ArtBox
, /TrimBox
and /BleedBox
. The /CropBox
key value, if present, may order the PDF viewer to hide parts of the complete page and show only a smaller viewport box of it (when printing or viewing).
One command line tool to determine the PDF page dimensions is pdfinfo
. This utility is based on the Poppler library -- so if you do not want to run an external command, bind your own application to this lib.
pdfinfo
is much faster:
It does not need to render or rasterize or fully interpret the PDF file.
It simply does a (very fast) lookup of the dictionary entries for the dimensions.
These dimensions are returned in points. This unit originates from the PostScript world: 72 points are equivalent to 1 inch. So at a resolution 72 DPI/PPI it also would show you the "dimensions in pixels" too...
I've run a pdfinfo
command against your linked example PDF to determine the dimensions of the page range 116-117 (using -f
for first and -l
for the last pages of the range). The command completed in fractions of a second:
Here are the results:
pdfinfo -f 116 -l 117 -box soils-of-manawatu-county-soil-survey-report-30.pdf
Title:
Subject:
Keywords:
Author:
Creator: ABBYY FineReader
Producer:
CreationDate: Tue Dec 18 19:11:50 2007
ModDate: Tue Dec 18 19:11:50 2007
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 117
Encrypted: no
Page 116 size: 957 x 751 pts
Page 116 rot: 0
Page 117 size: 2065 x 2249 pts
Page 117 rot: 0
Page 116 MediaBox: 0.00 0.00 957.00 751.00
Page 116 CropBox: 0.00 0.00 957.00 751.00
Page 116 BleedBox: 0.00 0.00 957.00 751.00
Page 116 TrimBox: 0.00 0.00 957.00 751.00
Page 116 ArtBox: 0.00 0.00 957.00 751.00
Page 117 MediaBox: 0.00 0.00 2065.00 2249.00
Page 117 CropBox: 0.00 0.00 2065.00 2249.00
Page 117 BleedBox: 0.00 0.00 2065.00 2249.00
Page 117 TrimBox: 0.00 0.00 2065.00 2249.00
Page 117 ArtBox: 0.00 0.00 2065.00 2249.00
File size: 2105582 bytes
Optimized: yes
PDF version: 1.2
As you can see, your PDF does not even have identical page dimensions for each of its 117 pages!
Now let's try the same with an ImageMagick command: ([1])
identify \
-format "%W x %H\n" \
soils-of-manawatu-county-soil-survey-report-30.pdf[115-116]
([1] Note: ImageMagick's page numbering method is zero-based {first page has number '0'} -- hence the [115-116]
range for pages 116-117.)
This takes 6 seconds to complete, and returns:
957 x 751
2065 x 2249
I've been lucky here, because Ghostscript seems to have been run with a parameter for resolution that equals -r72x72
.
I've seen cases where ImageMagick was set up to use -r75x75
-- which would of course return different values!
The next examples are done with a PDF that represents the User Manual for an IXUS 850 IS camera, as found on the web. I'll retrieve info for the first 3 pages only:
pdfinfo -box -l 3 _IXUS_850IS_ADVCUG_EN.pdf
Creator: FrameMaker 6.0
Producer: Acrobat Distiller 5.0.5 (Windows)
CreationDate: Thu Aug 17 16:43:06 2006
ModDate: Tue Aug 22 12:20:24 2006
Tagged: no
UserProperties: no
Suspects: no
Form: AcroForm
JavaScript: no
Pages: 146
Encrypted: no
Page 1 size: 419.535 x 297.644 pts
Page 1 rot: 90
Page 2 size: 297.646 x 419.524 pts
Page 2 rot: 0
Page 3 size: 297.646 x 419.524 pts
Page 3 rot: 0
Page 1 MediaBox: 0.00 0.00 595.00 842.00
Page 1 CropBox: 87.25 430.36 506.79 728.00
Page 1 BleedBox: 87.25 430.36 506.79 728.00
Page 1 TrimBox: 87.25 430.36 506.79 728.00
Page 1 ArtBox: 87.25 430.36 506.79 728.00
Page 2 MediaBox: 0.00 0.00 595.00 842.00
Page 2 CropBox: 148.17 210.76 445.81 630.28
Page 2 BleedBox: 148.17 210.76 445.81 630.28
Page 2 TrimBox: 148.17 210.76 445.81 630.28
Page 2 ArtBox: 148.17 210.76 445.81 630.28
Page 3 MediaBox: 0.00 0.00 595.00 842.00
Page 3 CropBox: 148.17 210.76 445.81 630.28
Page 3 BleedBox: 148.17 210.76 445.81 630.28
Page 3 TrimBox: 148.17 210.76 445.81 630.28
Page 3 ArtBox: 148.17 210.76 445.81 630.28
File size: 6888764 bytes
Optimized: yes
PDF version: 1.4
As one can see from the output all three page sizes ("/MediaBox
") are 595 x 842 pts
(==A4), but the different /CropBox
entries restrict the visible parts of the pages to view ports of these sizes:
419.535 x 297.644 pts
297.646 x 419.524 pts
297.646 x 419.524 pts
On top of that, the first page is rotated by 90 degrees (as can be seen from the line saying Page 1 rot: 90
).
Now let's compare what my ImageMagick command ([2]) returns:
identify -format "%W x %H\n" _IXUS_850IS_ADVCUG_EN.pdf[0-2]
842 x 595
595 x 842
595 x 842
([2] Note: The IM on my system is a 6.9.0-0 Q16 versio, which utilizes a Ghostscript v9.10 as a delegate. If you test the same thing on a different system with other IM/GS versions, your output may be different!)
So this last example may answer the "Are there other factors that can affect the calculation of PDF size in pixels?" part of the OP question.