Search code examples
pythonpdf

How to extract Table from PDF in Python?


I have thousands of PDF files, composed only by tables, with this structure:

pdf file

However, despite being fairly structured, I cannot read the tables without losing the structure.

I tried PyPDF2, but the data comes completely messed up.

import PyPDF2 

pdfFileObj = open(pdf_file.pdf, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0) 

print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0]) 
print(pageObj.extractText().split('/')[0]) 

I also tried Tabula, but it only reads the header (and not the content of the tables)

from tabula import read_pdf

pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content

Any thoughts?


Solution

  • After struggling a little bit, I found a way.

    For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.

    Here is the working code:

    import pypdf
    from tabula import read_pdf
    
    # Get the number of pages in the file
    pdf_reader = pypdf.PdfReader(pdf_file)
    n_pages = len(pdf_reader.pages)
    
    # For each page the table can be read with the following code
    table_pdf = read_pdf(
        pdf_file,
        guess=False,
        pages=1,
        stream=True,
        encoding="utf-8",
        area=(96, 24, 558, 750),
        columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
    )