I have a single pdf with 350 pages in which there are multiple electricity bills.But each bill is not of the same length...some have just 1 page others have 2 or 3 pages. I need to split this pdf accordingly.
I have the following code for splitting pdf into single pages:
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("80....pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("80...-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
I have identified a regex in the pdf by searching the string through Pypdf2 following is my code:
import PyPDF2
import re
object = PyPDF2.PdfFileReader("PDF.pdf")
NumPages = object.getNumPages()
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
#print(Text)
if re.search(r"Bill of Supply for Electricity", Text):
print("this is page " + str(i) + '\n First Page')
Regex = re.search(r"Bill of Supply for Electricity", Text).group()
print(Regex)
else:
print("this is page " + str(i) + '\n Not First Page')
I have found out the pages from where this particular string starts . Now I want to split the pdf accordingly so that it splits the pdf only when it again finds the regex 'Bill of Supply for Electricity'. For example if the first page has this regex and then the 3 rd page again has this regex then page 1 and 2 should make 1 pdf and then page 3 should make another . And if 4th page again has this regex then 3rd page should be a separate pdf and 4 th onwards should be separate until the same regex appears again and so on. How do I go about this??
Alright I've changed some of your variable names and I've removed the print statements. Let's start by building a function that will tell you where the page breaks need to be.
def getPagebreakList(file_name: str)->list:
pdf_file = PyPDF2.PdfFileReader(file_name)
num_pages = pdf_file.getNumPages()
page_breaks = list()
for i in range(0, num_pages):
Page = file.getPage(i)
Text = PageObj.extractText()
if re.search(r"Bill of Supply for Electricity", Text):
page_breaks.append(i)
return page_breaks
Next we're going to pop elements from the beginning of that page_breaks
list and use them as we move through the PDF file.
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("80....pdf", "rb"))
num_pages = inputpdf.numPages
page_breaks = getPagebreakList('yourPDF.pdf')
i = 0
while (i < num_pages):
if page_breaks:
page_break = page_breaks.pop(0)
else:
page_break = num_pages
output = PdfFileWriter()
while (i != page_break + 1):
output.addPage(inputpdf.getPage(i))
i = i + 1
with open("80...-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
Hopefully this works. I obviously have no way of testing as I don't happen to have a long PDF with a regex on some pages handy.