Search code examples
pythonpdfplaywrightplaywright-python

Playwright page.pdf() only gets one page


I have been trying to convert html to pdf. I have tried a lot of tools but none of them work. Now I am using playwright, it is converting the Page to PDF but it only gets the first screen view. From that page the content from right is trimmed.

import os
import time
import pathlib
from playwright.sync_api import sync_playwright

filePath = os.path.abspath("Lab6.html")
fileUrl = pathlib.Path(filePath).as_uri()
fileUrl = "file://C:/Users/PMYLS/Desktop/Code/ScribdPDF/Lab6.html"
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(fileUrl)
    for i in range(5): #(The scroll is not working)
        page.mouse.wheel(0, 15000)
        time.sleep(2)
    page.wait_for_load_state('networkidle') 
    page.emulate_media(media="screen")
    page.pdf(path="sales_report.pdf")
    browser.close()

Html View

Html view

PDF file after running script pdf view I have tried almost every tool available on the internet. I also used selenium but same results. I thought it was due to page not loaded properly, I added wait and manually scrolled the whole page to load the content. All giving same results.

The html I am converting https://drive.google.com/file/d/16jEq52iXtAMCg2FDt3VbQN0dCQmdTip_/view?usp=sharing


Solution

  • Here's a somewhat dirty solution that worked on my end. The sleep and scroll isn't great and can probably be improved, but I'll leave this as a starter and see if I have time to tighten it up later (feel free to do the same).

    from playwright.sync_api import sync_playwright # 1.37.0
    from time import sleep
    
    
    with open("index.html") as f:
        html = f.read()
    
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.set_content(html)
    
        # focus inside the annoying border to enable scroll
        page.click(".document_container")
    
        for i in range(10):
            page.mouse.wheel(0, 2500)
            sleep(0.5)
    
        # strip out the annoying border that messes up PDF generation
        page.evaluate("""() => {
            const el = document.querySelector(".document_scroller");
            el.parentElement.appendChild(el.querySelector(".document_container"));
            el.remove();
        }""")
        page.emulate_media(media="screen")
        page.pdf(path="sales_report.pdf")
        browser.close()
    

    Two tricks:

    1. Clicking inside the border area enables scrolling, which appears necessary to get everything to load.
    2. Ripping out the annoying border allows the PDF generation to capture all pages. When the border is present, there's no scroll on the main body, only on the interior container, which the PDF capture doesn't seem to understand.