I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files. That would be most ideal) but that is not available publicly.
I am using PyPDF2 to do this curretly
list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))
individual_files = []
stream = io.StringIO()
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
individual_files.append(output)
with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
outputStream.write(stream.getvalue())
#print(outputStream.read())
with open(outputStream.name, 'rb') as f:
data = f.seek(85)
data = f.read()
individual_files.append(data)
bucket.blob('processed/' + "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
In the output, I see different PyPDF2 objects such as
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0>
but I have no idea how I should proceed next.
There were two reasons why my program was not working:
with(open)
block outside of the first one,Below is the corrected code:
if inputpdf.numPages > 2:
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
output.write(outputStream)
with open(outputStream.name, 'rb') as f:
data = f.seek(0)
data = f.read()
#print(data)
bucket.blob(prefix + '/processed/' + "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
stream.truncate(0)