Search code examples
pythonpypdf

PDF manipulation with Python


I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:

from PyPDF2 import PdfFileWriter, PdfFileReader
import os

def changefile (file):

    infile = PdfFileReader(file, "rb")
    output = PdfFileWriter()
    numpages = infile.getNumPages()

    for i in range (numpages -1):
        p = infile.getPage(i)
        output.addPage(p)

    with open(file, 'wb') as f:
        output.write(f)

for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
    if file.endswith(".pdf") or file.endswith(".PDF"):
        changefile(file)

My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.

When I am trying to run it on the invoices I receive the following error:

line 1901, in read raise utils.PdfReadError("Could not find xref table at specified location") PyPDF2.utils.PdfReadError: Could not find xref table at specified location

I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.

I suppose this error is triggered by the signature left by our supplier on each PDF file.

Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...

Automation would be nice, because I daily put multiple invoices in the same directory.

Thanks.


Solution

  • The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:

    import pikepdf
    import os
    
    def changefile (file):
        print("Processing {0}".format(file))
        pdf = pikepdf.Pdf.open(file)
        lastPageNum = len(pdf.pages)
        pdf.pages.remove(p = lastPageNum)
        pdf.save(file + '.tmp')
        pdf.close()
        os.unlink(file)
        os.rename(file + '.tmp', file)
    
    for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
        if file.lower().endswith(".pdf"):
            changefile(file)
    

    Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/

    Let me know if that works for you.