Search code examples
pythonpdfpdfplumber

Extracting tables from a PDF with empty cells and no visible edges


I am using pdfplumber to extract data from the following PDF page:

Data in the test pdf file

import pdfplumber
 

pdf_file = 'D:/Input/Book1.pdf'
pdf = pdfplumber.open(pdf_file)

page = pdf.pages[0]
text = page.extract_text()
table  = page.extract_tables()

for line in text.split("\n"):
    print(line)

When I use page.extract_tables(), I only get the row headers, not the associated data in the table.

Since extract_tables() isn't working, I am using page.extract_text() to loop through it line by line. However, extract_text() seems to omit empty cells in the table data when reading a line.

Below data got extracted when using extract_text():

Weekly test report with multiple lines of hedder of the each page of report
col1 col2 col3 Start End Col Group
Name Name Name Date Date Col5 Col6 Col7 Currency
123 ABC 26/8/2024 26/8/2024 1000 20000 26/8/2024 USD
456 DEF New 26/8/2024 2000 15000 27/8/2024 INR
789 GES DDD 26/8/2024 26/8/2023 4000 20/4/2024 AUD

I want to create a data frame using the table data from the PDF.


Solution

  • It seems there is no direct way I can access table data using extract_tables() since my table doesn't have outlines.

    To Read the table I need to use extract_table() with table_settings.

    import os
    import pandas as pd
    import pdfplumber
    
    
    pdf_file = 'D:/Input/Book1.pdf'
    
    page_vertical_lines = [40,102,152,203,253,305,360,408,458,510]
    table_settings={"vertical_strategy": "explicit", 
                    "horizontal_strategy": "text"
                    ,"explicit_vertical_lines": page_vertical_lines,
                    }
    pdf = pdfplumber.open(pdf_file)
    df1 = pd.DataFrame(pdf.pages[0].extract_table(table_settings))
    
    

    This code Extracted the required number of columns from the pdf file.

    Challenge here is getting explicit_vertical_lines position

    I have used matplotlib library to identity the correct postion with below code. you need to adjust the postion according to your report columns.

    import pdfplumber
    import matplotlib.pyplot as plt
    
    pdf_file = 'D:/Input/Book1.pdf'
    
    with pdfplumber.open(pdf_file) as pdf:
        page = pdf.pages[0]
        im = page.to_image()
        
        vertical_lines = [40,102,152,203,253,305,360,408,458,510]
        for x in vertical_lines:
            im.draw_vline(x,stroke="blue",stroke_width=1)
        
        plt.imshow(im.annotated)
        plt.show()
    
    

    when you run this code it gives blue lines as per vertical_lines values on the pdf as shown below in the picture. Adjust the vertical_lines values as per your need.

    PDF page with blue lines