Search code examples
pythonpypdf

split a pdf into multiple pdfs of different page length using python


I have a single pdf with 350 pages in which there are multiple electricity bills.But each bill is not of the same length...some have just 1 page others have 2 or 3 pages. I need to split this pdf accordingly.

I have the following code for splitting pdf into single pages:

from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open("80....pdf", "rb"))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open("80...-page%s.pdf" % i, "wb") as outputStream:
        output.write(outputStream)

I have identified a regex in the pdf by searching the string through Pypdf2 following is my code:

import PyPDF2
import re

object = PyPDF2.PdfFileReader("PDF.pdf")

NumPages = object.getNumPages()

for i in range(0, NumPages):
    PageObj = object.getPage(i)

    Text = PageObj.extractText() 
    #print(Text)
    if re.search(r"Bill of Supply for Electricity", Text):
        print("this is page " + str(i) + '\n First Page') 
        Regex = re.search(r"Bill of Supply for Electricity", Text).group()
        print(Regex)
    else:
        print("this is page " + str(i) + '\n Not First Page')

I have found out the pages from where this particular string starts . Now I want to split the pdf accordingly so that it splits the pdf only when it again finds the regex 'Bill of Supply for Electricity'. For example if the first page has this regex and then the 3 rd page again has this regex then page 1 and 2 should make 1 pdf and then page 3 should make another . And if 4th page again has this regex then 3rd page should be a separate pdf and 4 th onwards should be separate until the same regex appears again and so on. How do I go about this??


Solution

  • Alright I've changed some of your variable names and I've removed the print statements. Let's start by building a function that will tell you where the page breaks need to be.

    def getPagebreakList(file_name: str)->list:
        pdf_file = PyPDF2.PdfFileReader(file_name)
        num_pages = pdf_file.getNumPages()
        page_breaks = list()
    
        for i in range(0, num_pages):
            Page = file.getPage(i)
            Text = PageObj.extractText() 
    
            if re.search(r"Bill of Supply for Electricity", Text):
                page_breaks.append(i)
    
        return page_breaks
    

    Next we're going to pop elements from the beginning of that page_breaks list and use them as we move through the PDF file.

    from PyPDF2 import PdfFileWriter, PdfFileReader
    
    inputpdf = PdfFileReader(open("80....pdf", "rb"))
    num_pages = inputpdf.numPages
    page_breaks = getPagebreakList('yourPDF.pdf')
    
    i = 0
    while (i < num_pages):
        if page_breaks:
            page_break = page_breaks.pop(0)
        else:
            page_break = num_pages
        output = PdfFileWriter()
        while (i != page_break + 1):
            output.addPage(inputpdf.getPage(i))
            i = i + 1
        with open("80...-page%s.pdf" % i, "wb") as outputStream:
            output.write(outputStream)
    

    Hopefully this works. I obviously have no way of testing as I don't happen to have a long PDF with a regex on some pages handy.