django python-3.x google-cloud-platform google-cloud-storage pdfkit

How to upload an html string as pdf file to Google Cloud Storage ? (Python)

I am trying to upload an HTML string, as a PDF file, to GCS, from my Django application.

import google, os
from google.cloud import storage


class GcpHelper:
    def __init__(self, bucket_name):
        self.service_account_json_path = 'google_auth.json'
        storage_client = storage.Client.from_service_account_json(self.service_account_json_path)
        try:
            self.__bucket_name = bucket_name
            self.bucket = storage_client.get_bucket(bucket_name)
        except Exception as err:
            logger.error("Error {} while connecting to bucket {}!".format(str(err), bucket_name))


    def put_data_in_bucket(self, file_name, data, content_type="application/pdf"):
        """Uploads data to gcp bucket"""
        try:
            blob = self.bucket.blob(file_name)
            blob.upload_from_string(data, content_type=content_type)
            return True
        except Exception as err:
            logger.error("Unable to upload file {} due to {}".format(file_name, str(err)))
            raise Exception("Write to gcp failed!")


gcp_helper = GcpHelper('bucket_name')
voucher_html = open('voucher_test.html').read()
#some operations on voucher_html string here
gcp_helper.put_data_in_bucket("booking/voucher.pdf", voucher_html)

I was trying to upload the string directly somehow, rather than saving it as PDF file and then uploading the file. (If nothing works, then will have to do that)

But of course this didn't work as the PDF file uploaded was corrupted. I was hoping the blob.upload_from_string would take care of any formatting/encoding that would be required.But as it seems, it doesn't. ;)

Solution

You could use temporary files to write on disk your PDF and after upload the file to Cloud storage

import os
from tempfile import NamedTemporaryFile


with NamedTemporaryFile(mode='w+b') as temp:
    #data msut be the file that came from the request
    temp.write(data)
    temp.close()
    with open(temp.name, 'rb') as pdf:
        blob.upload_from_file(pdf)

GCS never going to convert your HTML to PDF file

Converting HTML to PDF is always a difficult task, but there is no way to automate this with Cloud Storage.

To work with pdfkit and avoid any formatting problem, I recommend:

use pure HTML5+CSS and reduce the use of JS
use images instead of JS graphics
only use vanilla JS
the fastest way to load images is by loading the images as base64 string

In past projects used this strategy:

Create my images using phantom, because I had a lot of beautiful charts but using JS
Create an HTML file on the backend with all information and with the images embedded using base 64
I used celery to create a task queue, the pdf creation taking me 30 sec because every report has 500 - 1K pages

I found a similar approach in this github file

def to_pdf(self):
        template = get_template('{template}/{template}.html'.format(template=self.html_template))
        invoice = template.render({
            'site': Site.objects.get_current(),
            'invoice': self,
            'users': (
                ('provider', self.provider),
                ('client', self.client),
            ),
            'line_items': self.aggregate_line_items(),
            'currency': self.hourly_rate.hourly_rate_currency
        })
        self.pdf_path = os.path.join(settings.INVOICE_PDF_PATH, '{}.pdf'.format(uuid.uuid4()))
        pdf_configuration = pdfkit.configuration(wkhtmltopdf=settings.HTML_TO_PDF_BINARY_PATH)
        pdfkit.from_string(invoice, self.pdf_path, configuration=pdf_configuration, options=self.PDF_OPTIONS)
        return self.pdf_path