Search code examples
pythonsubprocesspython-pdfkit

Cannot write full HTML into the PDF


This has been pissing me off from yesterday, and I'm just out of ideas.

I'm trying to write a PDF with a subclassed pdfkit.PDFKit (let's call it MyPDFKit): it works well (I just subclassed it to add the possibility of using xvfb-run in the args). I specify that is not a problem of the class.

I was trying to convert some HTML to PDF. The template looks like this:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <!-- Simplified for reading. -->
    <style type="text/css">..</style>
  </head>
  <body>
    <!-- Simplified for reading. -->
    {% for obj in objs %}
    <div>
      <div>
        <p>{{ obj.name }}</p>
      </div>
      <p>{{ obj.age }}</p>
    </div>
    {% endfor %}
  </body>
</html>

With these template, and objs having near 400 instances, the output of the HTML is near 5k lines.

The problem comes when trying to splash that into the file. It could be in one of this two places:

  1. MyPDFKit.to_pdf(..) (called from MyPDFKit.from_string(..))'s stdout has a limit size, and truncates part of the string (source code of the function is here).
  2. f.write(..) is the one that truncates the string you pass in.

Cannot be a problem of the template or of the objects' data, because I can create PDFs correctly when getting only a certain range of then (more than 350 items in the same rendering starts leading to the problem due to HTML number of lines). For example, objs[:315] works well, but objs[:350] not.

I've tried setting the buffer size to -1, which is unlimited, but also don't work. Anyone had this issue before?


Solution

  • Ok, so finally, with the help of another programmer, I found the issue.

    It looks like PDFKit, when processing a large amount of HTML (in number of PDF pages we're talking more than 349 more or less), sends progress bars comments to the buffer to see how it goes. Then, when it finish, also sends a done comment message.

    This comments (I say comments to give they a type of data, cause I don't know really how PDF files handle comments), in programs like Adobe Reader, cannot be handle, so it detects that the file is corrupted/damage, while in programs like SumatraPDF/Edge, it just ignores then and shows the PDF nicely.

    Now, how to prevent this behaviour? Passing the --quiet argument. But, for that, you'll need to subclass PDFKit (as I did with MyPDFKit), and add the args manually (line of code).

    Problem solved.

    EDIT

    Seems that I can pass --quiet in the options kwargs, so no need of subclassing if that's only the problem (although it would be nice to have it active by default...)