Search code examples
pythonpython-3.xpypdf

Extracting a PDF Page With pypdf Consistently Creates PDF Without Any Pages


I am trying to programmatically split a PDF containing multiple articles into a PDF for each article. The read and page extraction appears to work, the file is created, but is only 311 bytes of what appears to contain PDF header information without any PDF pages, according to Adobe Reader.

I created a new one-page PDF that is about 132KB and a simple test program. The length of text looks correct but the output PDF is again only 311 bytes.

from pypdf import PdfReader, PdfWriter
input_pdf = PdfReader('testpdf.pdf')
page = input_pdf.pages[0]
print(len(page.extract_text()))
output = PdfWriter()
output.add_page = page
with open('testpdf_1.pdf', 'wb') as output_stream:
      output.write(output_stream)

If I run the code in a python interactive session, I see:

False, <_io.BufferedWriter name='testpdf_1.pdf'>) 

I am not sure this is an error, or at least I have not been able to find what the message means.

I am running pypdf 5.0.1 and python 3.8.0 in a venv.


Solution

  • based on your requirement,I believe the "page" should not be added to the PdfWriter object the way you did. You should call add_page method directly with output.add_page(page).I think the way you assigned output.add_page doesn't actually add the page but overwrites the method with the page object.Thats causing this issue. Please try below code and let me know if it works..otherwise we will try some other way.

     from pypdf import PdfReader, PdfWriter 
    input_pdf = PdfReader('testpdf.pdf')
    page = input_pdf.pages[0]
    print(len(page.extract_text()))  
    output = PdfWriter()
    output.add_page(page)  
    with open('testpdf_1.pdf', 'wb') as output_stream:
        output.write(output_stream)