Search code examples
pythonpdfplumber

pdfplumber extract table data works when the table has borders, doesn't work when the table has no borders


Using reportlab I made 2 1 page pdfs with 1 table:

The data in the table is this:

data1 = [['00', '', '02', '', '04'],  
['', '11', '', '13', ''],  
['20', '', '22', '23', '24'],  
['30', '31', '32', '', '34']]  

The point is, to get the rows including the empty cells. If the table has borders, no problem.

But if the table has no borders, I don't get any results for table from the code below.

Any ideas why?

Like I said, the pdfs are identical except for pdf1 does not have borders, pdf2 has borders.

with pdfplumber.open(path2pdf + savename1) as pdf1:
    # Get the first page of the object
    page = pdf1.pages[0]
    # Get the text data of the page
    text = page.extract_text()
    # Get all the tabular data of this page
    tables = page.extract_tables()
    # Traversing table
    for t_index in range(len(tables)):
        table = tables[t_index]
        # Traversing each row of data
        for data in table:
            print(data)

Change pdf1 for pdf2 and I get the required result.

EDIT: I tried with this, but get an error. Not sure how I should format it:

pdf_table = page.extract_tables(vertical_strategy='text', horizontal_strategy='text') Traceback (most recent call last): File "/usr/lib/python3.8/idlelib/run.py", line 559, in runcode exec(code, self.locals) File "<pyshell#70>", line 1, in TypeError: extract_tables() got an unexpected keyword argument 'vertical_strategy'


Solution

  • As per pdfplumber documentation, when calling the page.extract_tables() function, you have some table extraction settings that you may want to implement.

    By default, the strategy is to use the pages vertical or horizontal lines as cell separators, however, you can specify an alternative extraction strategy.

    The method can be customised by the following settings:

    {
        "vertical_strategy": "lines", 
        "horizontal_strategy": "lines",
        "explicit_vertical_lines": [],
        "explicit_horizontal_lines": [],
        "snap_tolerance": 3,
        "snap_x_tolerance": 3,
        "snap_y_tolerance": 3,
        "join_tolerance": 3,
        "join_x_tolerance": 3,
        "join_y_tolerance": 3,
        "edge_min_length": 3,
        "min_words_vertical": 3,
        "min_words_horizontal": 1,
        "keep_blank_chars": False,
        "text_tolerance": 3,
        "text_x_tolerance": 3,
        "text_y_tolerance": 3,
        "intersection_tolerance": 3,
        "intersection_x_tolerance": 3,
        "intersection_y_tolerance": 3,
    }
    

    The one that you may need to consider is the vertical and horizontal strategy settings.

    • vertical strategy options: "lines", "lines_strict", "text", or "explicit".
    • horizontal strategy options: "lines", "lines_strict", "text", or "explicit".
    • min_words_vertical - When using "vertical_strategy": "text", at least min_words_vertical words must share the same alignment.
    • min_words_horizontal - When using "horizontal_strategy": "text", at least min_words_horizontal words must share the same alignment.

    text:

    For vertical_strategy: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For horizontal_strategy, the same but using the tops of words.

    Use:

    .extract_table(table_settings={<put settings you need in here>})

    Note:

    Often it's helpful to crop a page — Page.crop(bounding_box) — before trying to extract the table.