I'm running in the browser. I have File
(the JavaScript File type) of type application/pdf
. I want to check that the format of the pdf is either US Letter (8.5 in. x 11 in.) or US Legal (8.5 in. x 14 in.) in either landscape or portrait orientation.
I've taken a look at jsPDF and though it looks great to create pdf documents programmatically (which will come in handy for tests) I was not able to find a way to use it to parse an existing PDF File and get information about the document (such as the page format and orientation).
Any help in achieving my goal will be appreciated, whether it is with jsPDF, another library, or vanilla JS.
Using simple text parsing, most but not all PDF files will have one or more /MediaBox
entries, that represent each page. /CropBox
is the size of the viewed page, thus potentially better if present. Page Lengths are usually given in points unless the page uses a different UserUnit. The format is [x0, y0, x1, y1], thus they may not always start with 0, or even be the values below, it is the difference between x0 and x1 that indicates nominal width.
Here are just a few of the first entries from recent European examples so note the variations, some are integer some real and some both mixed (like much pdf content there is no enforced rule)
/MediaBox [0 0 595.28 841.89]
/MediaBox [0 0 842 595]
/MediaBox[0 0 387.36 594]
for US letter and legal they are usually integer expect or search for
/MediaBox[0 0 612 792] and similar
In many cases all the pages are the same shape even if intended to be rotated later, but sometimes the pages can be mixed, however that requires search/counting all pages (presuming all are simple textual descriptions)
/MediaBox .... 0 576 720] = A US Gov Letter Portrait Page (8" x 10") [*]
/MediaBox .... 0 720 576] = A US Gov Letter Landscape Page
/MediaBox .... 0 576 756] = A US Gov Letter Portrait Page (8" x 10.5")
/MediaBox .... 0 756 576] = A US Gov Letter Landscape Page
/MediaBox .... 0 576 936] = A US Gov Legal Portrait Page (8" x 13") [*]
/MediaBox .... 0 936 576] = A US Gov Legal Landscape Page
/MediaBox .... 0 612 792] = A US Letter Portrait Page (8.5" x 11")
/MediaBox .... 0 792 612] = A US Letter Landscape Page
/MediaBox .... 0 612 936] = A US Gov Legal Portrait Page (8.5" x 13") [*]
/MediaBox .... 0 936 612] = A US Gov Legal Landscape Page
/MediaBox .... 0 612 1008] = A US Legal Portrait Page (8.5" x 14")
/MediaBox .... 0 1008 612] = A US Legal Landscape Page
There are other historic American Sizes
* https://en.wikipedia.org/wiki/Paper_size#Loose_sizes
Rotation is most frequently set to 0 then a matrix transformation or /action applied to rotate the first view, in those readers that support such actions or scripting without blocking. Thus not useful to search 1001x /Rotate 0
entries
So for example I should have added that the first random file I gave as example above is an upright portrait page narrow and tall but is a diagram to be read from the right as a landscape airport layout and tests would fail as to which way it should it be read as either portrait or landscape can be potentially set in the pdf, but its up to the user to read both texts as both portrait then landscape