Search code examples
pythongoogle-cloud-functionsgoogle-cloud-storagepdf-generationcloud-document-ai

How can I split a PDF in Google cloud storage?


I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files. That would be most ideal) but that is not available publicly.

I am using PyPDF2 to do this curretly

list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)

inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))

individual_files = []
stream = io.StringIO()

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    individual_files.append(output)
    with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
        outputStream.write(stream.getvalue())
        #print(outputStream.read())
        with open(outputStream.name, 'rb') as f:
            data = f.seek(85)
            data = f.read()
            individual_files.append(data)
            bucket.blob('processed/' +  "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')

In the output, I see different PyPDF2 objects such as <PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next.


Solution

  • There were two reasons why my program was not working:

    1. I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
    2. I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')

    Below is the corrected code:

    if inputpdf.numPages > 2:
       for i in range(inputpdf.numPages):
          output = PdfFileWriter()
          output.addPage(inputpdf.getPage(i))
          with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
               output.write(outputStream)
          with open(outputStream.name, 'rb') as f:
               data = f.seek(0)
               data = f.read()
               #print(data)
               bucket.blob(prefix + '/processed/' +  "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
          stream.truncate(0)