Search code examples
pdfsafaricrop

pdfposter: crop / tile / posterize long PDF to multi pages from Safari Export as PDF


When I save a webpage with Safari's > File > Export as PDF...
Safari renders a long PDF in several (long) pages.

Here a screenshot of Preview's Crop Inspector
The 200 inch height appears to be a distiller’s limit for PostScript, based on the Windows printer driver limitation.

enter image description here

Before saving I set Safari > Develop > Show Responsive Design Mode for my iPad mini with a resolution of 768 x 1024 (portrait)
The beauty of this feature (unlike File > Print) is that it can be used with Safari in Responsive Design Mode, so an exact snapshot of the webpage (responsive layout, images and even dark modes) gets exported to PDF, without any print margins and such.

--> Now I want to cut / tile / crop / posterize / de-impose (or whatever one should call it) these [200 inch or 14400 pt long] long pages into more manageable page sizes. So with Responsive Design Mode set to iPad mini (768 x 1024) I would like to cut to the same dimensions; a mediabox / cropbox of 768pt x 1024pt

I tried already various command line tools like BRISS, PDFTILECUT, PLAKATIV, MUPDF ecc.

Some libraries like the Python binding PYMUPDF seem to convert the PDF to an image first to get it cut, thus loosing all the hyperlinks = NO go

Until now I get a decent result with PDFPOSTER using following command line; I have set the height of the --poster-size BOX to something really long 100000pt

pdfposter \ 
-v \
-m 768x1024pt \
-p 768x1000000pt \
Safari-Export-as-PDF-IN.pdf \
Safari-Export-as-PDF-OUT.pdf

That works for all the pages, one after the other, but I can’t find a solution to set the Y coordinates of the first page to 0
The pages always seem to start from the bottom of the poster size, leaving space at the top..

Example PDF: >>> download here <<<

---------          =========
|       |          | xxxxx |
=========          | xxxxx |
| xxxxx |          | xxxxx |
---------          ---------
| xxxxx |          | xxxxx |
| xxxxx |    ->    | xxxxx |
| xxxxx |          | xxxxx |
---------          ---------
| xxxxx |          | xxxxx |
| xxxxx |          =========
| xxxxx |          |       |
=========          ---------

Solution

  • OK with a lot of testing I found out something: PDFPOSTER does not like PDF's generated from HTML

    I first made a 100x200px box in Illustrator and exported that to a PDF.

    than run:

    pdfposter -m 100x80pt -p 100x99999pt in-100x200.pdf out-100x200.pdf
    

    This gives me a very nice result, the first page has a Crop Box of 100x40px and a Media Box of 100x80px, the rest of the pages Crop & Media Boxes of 100x80px

    Than I made a very very basic HTML (left even out the doctype)

    <html>
    <body style="background-color:white;margin:0;padding:0">
    <div style="background-color:gold;width:100%;height:1500px"></div>
    </body>
    </html>
    

    and run:

    pdfposter -m 767x1024pt -p 767x99999pt cleanHTML-IN.pdf cleanHTML-OUT.pdf
    

    And get the first page with a white margin in the top, like in my initial problem.
    So this is actually the Crop Box which does not seem to be set when using a PDF generated from HTML?

    UPDATE:

    Thanks to PDFPOSTER I have found my way to PYPDF.
    Basically you define:
    reader = PdfReader('in.pdf')
    writer = PdfWriter()
    I than loop over the pages page_x = reader.pages[i] from the input file, set mediaboxes for each "new" page (like photocopying) and append it to the writer writer.add_page(page_x) Finally write out with writer.write()

    Regarding corrupt PDF files, PIKEPDF a Python wrapper around QPDF features automatic repairs just by opening and saving the file.

    # pikepdf / pikepdf: 
    # https://github.com/pikepdf/pikepdf
    # https://pikepdf.readthedocs.io/en/latest/
    # 
    # py-pdf / pypdf: 
    # https://github.com/py-pdf/pypdf
    # https://pypdf.readthedocs.io/en/latest/
    
    import pikepdf, os, math
    from pypdf import PdfWriter, PdfReader
    
    # define, could become arguments
    pagecut_h  = 1024
    inputfile  = 'in.pdf'
    outputfile = 'out.pdf'
    
    # repair with PikePDF
    print("repairing {0} .....".format(inputfile))
    pdf = pikepdf.Pdf.open(inputfile)
    pdf.save(inputfile + '.tmp')
    pdf.close()
    os.unlink(inputfile)
    os.rename(inputfile + '.tmp', inputfile)
    
    reader = PdfReader(inputfile)
    writer = PdfWriter()
    
    pages_n = len(reader.pages)
    print('reading ..... {} input pages'
        .format(pages_n))
    
    for i in range(pages_n):
        
        page   = reader.pages[i]
        page_w = page.mediabox.width
        page_h = page.mediabox.height
        
        print('input page {}/{} [w:{}, h:{}]'
            .format(i + 1, pages_n, page_w, page_h))
        
        if (page_h <= pagecut_h):
            print('> input page height is smaller than the cut height')
            print('appending original input page [w:{}, h:{}]'
                .format(page_w, page_h))
            writer.add_page(page)
        else:
            pagesfull_n = math.floor(page_h / pagecut_h)
            print('calculating .......... {} output pages'
                .format(pagesfull_n + 1))
            
            # first FULL page
            page.mediabox.left   = 0
            page.mediabox.right  = page_w
            page.mediabox.top    = page_h
            page.mediabox.bottom = page_h - pagecut_h
            print('appending output page 1/{} [w:{}, h:{}]'.
                format((pagesfull_n + 1), page_w, pagecut_h))
            writer.add_page(page)
            
            # other FULL pages
            for j in range(pagesfull_n - 1):
                page.mediabox.top    -= pagecut_h
                page.mediabox.bottom -= pagecut_h
                print('appending output page {}/{} [w:{}, h:{}]'
                    .format((j + 2), (pagesfull_n + 1), page_w, pagecut_h))
                writer.add_page(page)
            
            # LAST (not full) page
            pagelast_h  = (page_h - (pagecut_h * pagesfull_n))
            page.mediabox.top    = pagelast_h
            page.mediabox.bottom = 0
            print('appending last output page {}/{} [w:{}, h:{}]'
                .format((pagesfull_n + 1), (pagesfull_n + 1), page_w, pagelast_h))
            writer.add_page(page)
        
    with open(outputfile, 'wb') as fp:
        writer.write(fp)