When I save a webpage with Safari's > File > Export as PDF...
Safari renders a long PDF in several (long) pages.
Here a screenshot of Preview's Crop Inspector
The 200 inch height appears to be a distiller’s limit for PostScript, based on the Windows printer driver limitation.
Before saving I set Safari > Develop > Show Responsive Design Mode
for my iPad mini with a resolution of 768 x 1024 (portrait)
The beauty of this feature (unlike File > Print) is that it can be used with Safari in Responsive Design Mode, so an exact snapshot of the webpage (responsive layout, images and even dark modes) gets exported to PDF, without any print margins and such.
--> Now I want to cut / tile / crop / posterize / de-impose (or whatever one should call it) these [200 inch or 14400 pt long] long pages into more manageable page sizes. So with Responsive Design Mode set to iPad mini (768 x 1024) I would like to cut to the same dimensions; a mediabox / cropbox of 768pt x 1024pt
I tried already various command line tools like BRISS, PDFTILECUT, PLAKATIV, MUPDF ecc.
Some libraries like the Python binding PYMUPDF seem to convert the PDF to an image first to get it cut, thus loosing all the hyperlinks = NO go
Until now I get a decent result with PDFPOSTER using following command line; I have set the height of the --poster-size
BOX to something really long 100000pt
pdfposter \
-v \
-m 768x1024pt \
-p 768x1000000pt \
Safari-Export-as-PDF-IN.pdf \
Safari-Export-as-PDF-OUT.pdf
That works for all the pages, one after the other, but I can’t find a solution to set the Y coordinates of the first page to 0
The pages always seem to start from the bottom of the poster size, leaving space at the top..
Example PDF: >>> download here <<<
--------- =========
| | | xxxxx |
========= | xxxxx |
| xxxxx | | xxxxx |
--------- ---------
| xxxxx | | xxxxx |
| xxxxx | -> | xxxxx |
| xxxxx | | xxxxx |
--------- ---------
| xxxxx | | xxxxx |
| xxxxx | =========
| xxxxx | | |
========= ---------
OK with a lot of testing I found out something: PDFPOSTER does not like PDF's generated from HTML
I first made a 100x200px box in Illustrator and exported that to a PDF.
than run:
pdfposter -m 100x80pt -p 100x99999pt in-100x200.pdf out-100x200.pdf
This gives me a very nice result, the first page has a Crop Box of 100x40px and a Media Box of 100x80px, the rest of the pages Crop & Media Boxes of 100x80px
Than I made a very very basic HTML (left even out the doctype)
<html>
<body style="background-color:white;margin:0;padding:0">
<div style="background-color:gold;width:100%;height:1500px"></div>
</body>
</html>
and run:
pdfposter -m 767x1024pt -p 767x99999pt cleanHTML-IN.pdf cleanHTML-OUT.pdf
And get the first page with a white margin in the top, like in my initial problem.
So this is actually the Crop Box which does not seem to be set when using a PDF generated from HTML?
UPDATE:
Thanks to PDFPOSTER I have found my way to PYPDF.
Basically you define:
reader = PdfReader('in.pdf')
writer = PdfWriter()
I than loop over the pagespage_x = reader.pages[i]
from the input file, set mediaboxes for each "new" page (like photocopying) and append it to the writerwriter.add_page(page_x)
Finally write out withwriter.write()
Regarding corrupt PDF files, PIKEPDF a Python wrapper around QPDF features automatic repairs just by opening and saving the file.
# pikepdf / pikepdf:
# https://github.com/pikepdf/pikepdf
# https://pikepdf.readthedocs.io/en/latest/
#
# py-pdf / pypdf:
# https://github.com/py-pdf/pypdf
# https://pypdf.readthedocs.io/en/latest/
import pikepdf, os, math
from pypdf import PdfWriter, PdfReader
# define, could become arguments
pagecut_h = 1024
inputfile = 'in.pdf'
outputfile = 'out.pdf'
# repair with PikePDF
print("repairing {0} .....".format(inputfile))
pdf = pikepdf.Pdf.open(inputfile)
pdf.save(inputfile + '.tmp')
pdf.close()
os.unlink(inputfile)
os.rename(inputfile + '.tmp', inputfile)
reader = PdfReader(inputfile)
writer = PdfWriter()
pages_n = len(reader.pages)
print('reading ..... {} input pages'
.format(pages_n))
for i in range(pages_n):
page = reader.pages[i]
page_w = page.mediabox.width
page_h = page.mediabox.height
print('input page {}/{} [w:{}, h:{}]'
.format(i + 1, pages_n, page_w, page_h))
if (page_h <= pagecut_h):
print('> input page height is smaller than the cut height')
print('appending original input page [w:{}, h:{}]'
.format(page_w, page_h))
writer.add_page(page)
else:
pagesfull_n = math.floor(page_h / pagecut_h)
print('calculating .......... {} output pages'
.format(pagesfull_n + 1))
# first FULL page
page.mediabox.left = 0
page.mediabox.right = page_w
page.mediabox.top = page_h
page.mediabox.bottom = page_h - pagecut_h
print('appending output page 1/{} [w:{}, h:{}]'.
format((pagesfull_n + 1), page_w, pagecut_h))
writer.add_page(page)
# other FULL pages
for j in range(pagesfull_n - 1):
page.mediabox.top -= pagecut_h
page.mediabox.bottom -= pagecut_h
print('appending output page {}/{} [w:{}, h:{}]'
.format((j + 2), (pagesfull_n + 1), page_w, pagecut_h))
writer.add_page(page)
# LAST (not full) page
pagelast_h = (page_h - (pagecut_h * pagesfull_n))
page.mediabox.top = pagelast_h
page.mediabox.bottom = 0
print('appending last output page {}/{} [w:{}, h:{}]'
.format((pagesfull_n + 1), (pagesfull_n + 1), page_w, pagelast_h))
writer.add_page(page)
with open(outputfile, 'wb') as fp:
writer.write(fp)