Search code examples
pythonpdfpypdf

Crop a Page in Python Using pyPdf


I am writing a script to extract some data from a PDF. The PDF itself is pretty complicated, since it has multiple columns. So I figured out that I should crop each column and concatenate the columns to make a new PDF that is better for parsing using pyPdf. This is my code:

for i in range(numPages):
    page1 = input1.getPage(i)
    page1.trimBox.lowerLeft=(0,550)
    page1.trimBox.upperRight = (480, 842)
    page1.cropBox.lowerLeft = (0, 550)
    page1.cropBox.upperRight = (480, 842)
    output.addPage(page1)
    page2= input2.getPage(i)
    print page1.mediaBox.getUpperRight_x(), page1.mediaBox.getUpperRight_y()
    page2.trimBox.lowerLeft=(0,280)
    page2.trimBox.upperRight = (480, 550)
    page2.cropBox.lowerLeft = (0, 280)
    page2.cropBox.upperRight = (480, 550)
    output.addPage(page2)
    page3 = input3.getPage(i)
    page3.trimBox.lowerLeft=(0,0)
    page3.trimBox.upperRight = (480, 280)
    page3.cropBox.lowerLeft = (0, 0)
    page3.cropBox.upperRight = (480, 280)
    output.addPage(page3)

outputStream = file("out.pdf", "wb")
output.write(outputStream)
outputStream.close()

Then, I send this PDF to a PHP server to parse it and obtain the text. Unexpectedly, that did not help. cropBox turned out to be changing the viewable part of the PDF. The other parts are there, but they just cannot be viewed. When I processed the new PDF using PHP, I got the same results. My question is: is there a way to make cropBox really crop the box and ignore the remaining part of the PDF page?


Solution

  • I tried multiple other libraries in Python, but they did not help. Later, I stumbled upon pdfBox which proved to be an extremely useful library. Much better than PDFMiner and pyPdf in terms of text extraction. I could extract the text using x and y locations of rectangles with height and width.Its only drawback was that I did not find a Python wrapper for it, so I had to write the application in Java.