Search code examples
pythondjangohtmldoc

Convert html to pdf on the fly using Python and HTMLDOC


I built a Django application for a client about a year ago. He has now resold the application to some super secret government agency that they won't even tell me the name of.

Part of the application dynamically generates PDF files using the python library xhtml2pdf (pisa). The Government doesn't like this library, they won't tell me why, they said I have to use HTMLDOC for pdf generation.

There's not much documentation on this library, but from reading the PHP example, it looks like you can just communicate with it through the shell, so it should work with Python. However, I'm having a hard time passing the html to HTMLDOC. It looks like HTMLDOC will only accept a file, but I really need to pass the html as a string since it's dynamically generated. (Or write the html string to a temporary file and then pass that temporary file to HTMLDOC).

I thought StringIO would work for this, but I'm getting an error. Here's what I have:

def render_to_pdf(template_src, context_dict):
    template = get_template(template_src)
    context = Context(context_dict)
    html  = template.render(context)
    result = StringIO.StringIO(html.encode("utf-8"))
    os.putenv("HTMLDOC_NOCGI", "1")

    #this line throws "[Errno 2] No such file or directory"
    htmldoc = subprocess.Popen("htmldoc -t pdf --quiet '%s'" % result, stdout=subprocess.PIPE).communicate()

    pdf = htmldoc[0]
    result.close()
    return HttpResponse(pdf, mimetype='application/pdf')

Any ideas, tips, or help would be really appreciated.

Thanks.

UPDATE

Stack Trace:

Environment:


Request Method: GET
Request URL: (redacted)

Django Version: 1.3 alpha 1 SVN-14921
Python Version: 2.6.5
Installed Applications:
['django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.messages',
 'django.contrib.admin',
 'application']
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware')


Traceback:
File "/usr/local/lib/python2.6/dist-packages/django/core/handlers/base.py" in get_response

  111. response = callback(request, *callback_args, **callback_kwargs)

File "/usr/local/lib/python2.6/dist-packages/django/contrib/auth/decorators.py" in _wrapped_view

  23. return view_func(request, *args, **kwargs)

File "/home/ascgov/application/views/pdf.py" in application_pdf

  90. 'user':owner})

File "/home/ascgov/application/views/pdf.py" in render_to_pdf

  53. htmldoc = subprocess.Popen("/usr/bin/htmldoc -t pdf --quiet '%s'" % result, stdout=subprocess.PIPE).communicate()

File "/usr/lib/python2.6/subprocess.py" in __init__

  633. errread, errwrite)

File "/usr/lib/python2.6/subprocess.py" in _execute_child

  1139. raise child_exception

Exception Type: OSError at /pdf/application/feed-filtr/
Exception Value: [Errno 2] No such file or directory

Solution

  • First, subprocess.Popen's first arg should generally be a list (unless you also pass shell=True). The No such file or directory is almost certainly caused by the absence of a file named "htmldoc -t pdf --quiet '... on the system (it's trying to find and run the program named for the whole string value).

    Second, if you give htmldoc some html on its stdin, it'll spit out a pdf on its stdout, thus avoiding the need for a temporary file.

    Give this a try (untested):

    htmldoc = subprocess.Popen(
      ['/usr/bin/htmldoc', '-t', 'pdf', '--webpage', '-'], 
      stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )
    stdout, stderr = htmldoc.communicate(html)
    

    NB: substitute /usr/bin/htmldoc for the real path to htmldoc on your system.

    The - argument to the htmldoc program, tells it to read from stdin. You'll pass your html string value (html) to htmldoc's stdin as an argument to the htmldoc.communicate call. The resulting pdf output should be available in stdout, and any other messages or stats in stderr.

    Edit: The documentation does seem a bit wonky, but there is quite a bit of it. You might have better luck with the html in one page or pdf versions, or the man page.

    Also, be sure to pass a string, or similar, to the stdin of the htmldoc process. Passing a StringIO object directly, as was implied by my previous code snippet, won't work.