Search code examples
pythonpdfpdfplumber

Scraping pdf tables including empty cells using python pdfplumber


I am using pdfplumber with python to extract data from the following pdf file

Data in the test pdf file


import pdfplumber
 

pdf_file = 'D:/Input/Book1.pdf'
pdf = pdfplumber.open(pdf_file)

page = pdf.pages[0]
text = page.extract_text()
table  = page.extract_tables()

for line in text.split("\n"):
    print(line)

when I use page.extract_tables() I only get the row headers and not associated data in the table.

Since extract_tables() not working I am using page.extract_text() to loop line by line. But, it this extract_text() omitting the empty cells in the table data while reading that line.

Below data got extracted when using extract_text()

Weekly test report with multiple lines of hedder of the each page of report
col1 col2 col3 Start End Col Group
Name Name Name Date Date Col5 Col6 Col7 Currency
123 ABC 26/8/2024 26/8/2024 1000 20000 26/8/2024 USD
456 DEF New 26/8/2024 2000 15000 27/8/2024 INR
789 GES DDD 26/8/2024 26/8/2023 4000 20/4/2024 AUD

I want to create a df using the table data that I have it in the pdf. any help would be aprreciated.


Solution

  • It seems there is no direct way I can access table data using extract_tables() since my table doesn't have outlines.

    To Read the table I need to use extract_table() with table_settings.

    import os
    import pandas as pd
    import pdfplumber
    
    
    pdf_file = 'D:/Input/Book1.pdf'
    
    page_vertical_lines = [40,102,152,203,253,305,360,408,458,510]
    table_settings={"vertical_strategy": "explicit", 
                    "horizontal_strategy": "text"
                    ,"explicit_vertical_lines": page_vertical_lines,
                    }
    pdf = pdfplumber.open(pdf_file)
    df1 = pd.DataFrame(pdf.pages[0].extract_table(table_settings))
    
    

    This code Extracted the required number of columns from the pdf file.

    Challenge here is getting explicit_vertical_lines position

    I have used matplotlib library to identity the correct postion with below code. you need to adjust the postion according to your report columns.

    import pdfplumber
    import matplotlib.pyplot as plt
    
    pdf_file = 'D:/Input/Book1.pdf'
    
    with pdfplumber.open(pdf_file) as pdf:
        page = pdf.pages[0]
        im = page.to_image()
        
        vertical_lines = [40,102,152,203,253,305,360,408,458,510]
        for x in vertical_lines:
            im.draw_vline(x,stroke="blue",stroke_width=1)
        
        plt.imshow(im.annotated)
        plt.show()
    
    

    when you run this code it gives blue lines as per vertical_lines values on the pdf as shown below in the picture. Adjust the vertical_lines values as per your need.

    PDF page with blue lines