python google-cloud-functions google-cloud-storage pdf-generation cloud-document-ai

How can I split a PDF in Google cloud storage?

I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files. That would be most ideal) but that is not available publicly.

I am using PyPDF2 to do this curretly

list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)

inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))

individual_files = []
stream = io.StringIO()

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    individual_files.append(output)
    with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
        outputStream.write(stream.getvalue())
        #print(outputStream.read())
        with open(outputStream.name, 'rb') as f:
            data = f.seek(85)
            data = f.read()
            individual_files.append(data)
            bucket.blob('processed/' +  "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')

In the output, I see different PyPDF2 objects such as <PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next.

Solution

There were two reasons why my program was not working:

I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')

Below is the corrected code:

if inputpdf.numPages > 2:
   for i in range(inputpdf.numPages):
      output = PdfFileWriter()
      output.addPage(inputpdf.getPage(i))
      with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
           output.write(outputStream)
      with open(outputStream.name, 'rb') as f:
           data = f.seek(0)
           data = f.read()
           #print(data)
           bucket.blob(prefix + '/processed/' +  "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
      stream.truncate(0)