Search code examples
pythonpandasdataframepdftabula-py

concat pdf tables into one excel table using python


I'm using tabula in order to concat all tables in the following pdf file

To be a one table within excel format.

Here's my code:

from tabula import read_pdf
import pandas as pd

allin = []
for page in range(1, 115):
    table = read_pdf("goal.pdf", pages=page,
                     pandas_options={'header': None})[0]
    allin.append(table)


new = pd.concat(allin)

new.to_excel("out.xlsx", index=False)

Also i tried the following as well:

from tabula import read_pdf
import pandas as pd

table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})

new = pd.concat(table, ignore_index=True)

new.to_excel("out.xlsx", index=False)

Current output: check

But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.

I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.

from tabula import read_pdf
import pandas as pd

table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]


print(table)

enter image description here

Example:

from tabula import read_pdf
import pandas as pd

table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})

new = pd.concat(table, ignore_index=True)

new.to_excel("out.xlsx", index=False)

Here I've ran the code for two pages 90 and 91.

starting from row# 48 you will see the difference here

Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well


Solution

  • I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.

    tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.

    enter image description here

    from tabula import read_pdf
    import pandas as pd
    
    # display all columns in dataframe
    pd.set_option('display.width', None)
    
    columns = [210, 350, 420, 450]  # boundaries in points
    #columns = ['210,350,420,450']   # boundaries in points
    
    pages =  '90-92'
    #pages = [90,91,92]
    #pages = list(range(90,93))
    #pages = 'all'  # read all pages 
    
    tables = read_pdf("goal.pdf",
                      pages=pages,
                      pandas_options={'header': None},
                      columns=columns,
                      guess=False)
    
    df = pd.concat(tables).reset_index(drop=True)
    #df.rename(columns=df.iloc[0], inplace=True)  # convert first row to headers
    #df.drop(df.index[0], inplace=True)           # remove first row with headers 
    
    # display
    
    #for x in range(0, len(df), 20):
    #    print(df.iloc[x:x+20])
    #    print('----------')
    
    print(df.iloc[45:50])
    
    #df.to_csv('output-pdf.csv')
    
    #print(df[ df['State'].str.contains(' ') ])
    #print(df[ df.iloc[:,3].str.contains(' ') ])
    

    Result:

                                          0                         1       2   3               4
    45                        JARRARD, GARY      930 FORT WORTH DRIVE  DENTON  TX  (940) 565-6548
    46                        JARRARD, GARY        2219 COLORADO BLVD  DENTON  TX  (940) 380-1661
    47  MASON HARRISON, RATLIFF ENTERPRISES  1815 W. UNIVERSITY DRIVE  DENTON  TX  (940) 387-5431
    48  MASON HARRISON, RATLIFF ENTERPRISES          109 N. LOOP #288  DENTON  TX  (940) 484-2904
    49  MASON HARRISON, RATLIFF ENTERPRISES      930 FORT WORTH DRIVE  DENTON  TX  (940) 565-6548
    

    EDIT:

    It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.

    I didn't check all rows but it may need some changes in column boundaries.


    EDIT:

    Few rows make problem - probably because text in City is too long.

    col3 = df.iloc[:,3]
    
    print(df[ col3.str.contains(' ') ])
    

    Result:

                                  0                       1                 2         3                 4
    1941  UMSTATTD RESTAURANTS, LLC  120 WEST US HIGHWAY 54  EL DORADO SPRING      MS O    (417) 876-5755
    2079               SIMONS, GARY         1412 BURLINGTON  NORTH KANSAS CIT      MY O    (816) 421-5941
    2763       GRISHAM, ROBERT (RB)   403 WEST COURT STREET    WASHINGTON COU  ORTH HOU  S(E740) 335-7830
    2764            STAUFFER, JACOB   403 WEST COURT STREET    WASHINGTON COU  ORTH HOU  S(E740) 335-7830