I am using pdfplumber
to extract data from the following PDF page:
import pdfplumber
pdf_file = 'D:/Input/Book1.pdf'
pdf = pdfplumber.open(pdf_file)
page = pdf.pages[0]
text = page.extract_text()
table = page.extract_tables()
for line in text.split("\n"):
print(line)
When I use page.extract_tables()
, I only get the row headers, not the associated data in the table.
Since extract_tables()
isn't working, I am using page.extract_text()
to loop through it line by line. However, extract_text()
seems to omit empty cells in the table data when reading a line.
Below data got extracted when using extract_text()
:
Weekly test report with multiple lines of hedder of the each page of report
col1 col2 col3 Start End Col Group
Name Name Name Date Date Col5 Col6 Col7 Currency
123 ABC 26/8/2024 26/8/2024 1000 20000 26/8/2024 USD
456 DEF New 26/8/2024 2000 15000 27/8/2024 INR
789 GES DDD 26/8/2024 26/8/2023 4000 20/4/2024 AUD
I want to create a data frame using the table data from the PDF.
It seems there is no direct way I can access table data using extract_tables()
since my table doesn't have outlines.
To Read the table I need to use
extract_table()
withtable_settings
.
import os
import pandas as pd
import pdfplumber
pdf_file = 'D:/Input/Book1.pdf'
page_vertical_lines = [40,102,152,203,253,305,360,408,458,510]
table_settings={"vertical_strategy": "explicit",
"horizontal_strategy": "text"
,"explicit_vertical_lines": page_vertical_lines,
}
pdf = pdfplumber.open(pdf_file)
df1 = pd.DataFrame(pdf.pages[0].extract_table(table_settings))
This code Extracted the required number of columns from the pdf file.
Challenge here is getting
explicit_vertical_lines
position
I have used matplotlib
library to identity the correct postion with below code. you need to adjust the postion according to your report columns.
import pdfplumber
import matplotlib.pyplot as plt
pdf_file = 'D:/Input/Book1.pdf'
with pdfplumber.open(pdf_file) as pdf:
page = pdf.pages[0]
im = page.to_image()
vertical_lines = [40,102,152,203,253,305,360,408,458,510]
for x in vertical_lines:
im.draw_vline(x,stroke="blue",stroke_width=1)
plt.imshow(im.annotated)
plt.show()
when you run this code it gives blue lines as per vertical_lines
values on the pdf as shown below in the picture. Adjust the vertical_lines
values as per your need.