I am using a PDF with multiple pages that has a table on top of each page that I want to get rid of. So I am cropping the PDF after the top table.
What I don't know is how to combine or save it as 1 single PDF after cropping it.
I have tried below:
import pandas as pd
import pdfplumber
path = r"file-tests.pdf"
with pdfplumber.open(path) as pdf:
pages = pdf.pages
# loop over each page
for p in pages:
print(p)
# this will give us the box dimensions in (x0,yo,x1,y1) format
bbox_vals = p.find_tables()[0].bbox
# taking y1 values as to keep/extract the portion of pdf page after 1st table
y0_top_table = bbox_vals[3]
print(y0_top_table)
# cropping pdf page from left to right and y value taken from above box to bottom of pg
p.crop((0, y0_top_table, 590, 840))
Output:
<Page:1>
269.64727650000003
<Page:2>
269.64727650000003
<Page:3>
269.64727650000003
<Page:4>
269.64727650000003
<Page:5>
269.64727650000003
<Page:6>
269.64727650000003
<Page:7>
269.64727650000003
<Page:8>
269.64727650000003
<Page:9>
269.64727650000003
<Page:10>
269.64727650000003
<Page:11>
269.64727650000003
<Page:12>
269.64727650000003
<Page:13>
269.64727650000003
<Page:14>
269.64727650000003
<Page:15>
269.64727650000003
<Page:16>
269.64727650000003
<Page:17>
269.64727650000003
<Page:18>
269.64727650000003
<Page:19>
269.64727650000003
<Page:20>
269.64727650000003
How do I append, save these cropped pages into 1 PDF?
Update:
Seems like its not possible to write or save pdf file using pdfplumber
as per this discussion link
(Not sure why this question was degraded to negative. Person who do that should also provide the answer or link to where this is already answered).
Update2:
from pdfrw import PdfWriter
output_pdf = PdfWriter()
with pdfplumber.open(path) as pdf:
pages = pdf.pages
for p in pages:
print(p)
bbox_vals = p.find_tables()[0].bbox
y0_top_table = bbox_vals[3]
print(y0_top_table)
cropped_pdf = p.crop((0, y0_top_table, 590, 840))
print(type(cropped_pdf))
output_pdf.addpage(cropped_pdf)
output_pdf.write(r"tests_cropped_file.pdf")
Output & Error:
<Page:1>
269.64727650000003
<class 'pdfplumber.page.CroppedPage'>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[219], line 13
11 cropped_pdf = p.crop((0, y0_top_table, 590, 840))
12 print(type(cropped_pdf))
---> 13 output_pdf.addpage(cropped_pdf)
File c:\Users\vinee\anaconda3\envs\llma_py_3_12\Lib\site-packages\pdfrw\pdfwriter.py:270, in PdfWriter.addpage(self, page)
268 def addpage(self, page):
269 self._trailer = None
--> 270 if page.Type != PdfName.Page:
271 raise PdfOutputError('Bad /Type: Expected %s, found %s'
272 % (PdfName.Page, page.Type))
273 inheritable = page.inheritable # searches for resources
AttributeError: 'CroppedPage' object has no attribute 'Type'
Update 3:
Seems like this issue of cropping pdf and saving was also raised in 2018 but had no solution as per this discussion link.
If anyone knows workaround then pls let me know. Would really Appreciate !!!
pdfplumber 0.11.4 pillow 9.5.0
Actually, it is possible to crop and save data as PDF with pdfplumber
, but only if you don't need further data extraction.
Let's say, that you want to supply someone with depersonalized medical document for visual reference, no further processing of the data is expected. In this case, you could crop pages and save them as images in PDF like follows (note that in your sample document, personal info is located within the first rectangle on a page):
import pdfplumber
source_path = '.../sample_report.pdf'
destination_path = 'data.pdf'
pdf = pdfplumber.open(source_path)
cropped_pages = []
for page in pdf.pages:
x0, x1 = 0, page.width
y0, y1 = page.rects[0]['bottom'], page.height
cropped_pages.append(page.crop([x0, y0, x1, y1])
.to_image(resolution=400)
.annotated)
cropped_pages[0].save(destination_path,
save_all=True,
append_images=cropped_pages[1:])
It can be done because page.to_image().annotated
is a Pillow Image object, which in turn can be saved as PDF with additional images passed as a append_images
parameter (save_all=True
is required in this case).