Search code examples
pythonmatplotlibpypdf

crop a pdf with PyPDF2


I've been working on a project in which I extract table data from a pdf with neural network, I successfuly detect tables and get their coordinate (x,y,width,height) , I've been trying to crop the pdf with pypdf2 to isolate the table but for some reason cropping never matches the desired outcome. After running inference i get these coordinates

[[5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02 9.9353129e-01]]

the 5th number is my neural network precision , we can safely ignore it

trying them in pyplot works , so there's no problem with them: Matplot

However using the same coords in pypdf2 is always off

from PyPDF2 import PdfFileWriter, PdfFileReader

with open("mypdf.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()

    for i in range(numPages):
        page = input1.getPage(i)
        page.cropBox.upperLeft = (5.0948269e+01,1.5970685e+02)
        page.cropBox.upperLeft = (1.1579385e+03, 2.7092386e+02)
     
        
        output.addPage(page)
        with open("out.pdf", "wb") as out_f:
          output.write(out_f)

This is the output I get :

Cropped PDF Am i missing something ?

thank you !


Solution

  • Here you go:

    from PyPDF2 import PdfFileWriter, PdfFileReader
    
    with open("mypdf.pdf", "rb") as in_f:
        input1 = PdfFileReader(in_f)
        output = PdfFileWriter()
    
        numPages = input1.getNumPages()
    
        x, y, w, h = (5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02)
    
        page_x, page_y = input1.getPage(0).cropBox.getUpperLeft()
        upperLeft = [page_x.as_numeric(), page_y.as_numeric()] # convert PyPDF2.FloatObjects into floats
        new_upperLeft  = (upperLeft[0] + x, upperLeft[1] - y)
        new_lowerRight = (new_upperLeft[0] + w, new_upperLeft[1] - h)
    
        for i in range(numPages):
            page = input1.getPage(i)
            page.cropBox.upperLeft  = new_upperLeft
            page.cropBox.lowerRight = new_lowerRight
    
            output.addPage(page)
    
        with open("out.pdf", "wb") as out_f:
            output.write(out_f)
    

    Note: in PyPDF2 the origin of coordinates placed in the lower left corner of a page. And the Y-axis is directed from the bottom to up. Not like on the screen. So if you want to get a PDF-coordinate of top edge of your crop area you need to subtract y-coordinate of the top edge of the crop area from the height of the page.

    enter image description here