Search code examples
pythontext-extractiontabula-py

extracting all tables using tabula


While reading a pdf file using df = tabula.read_pdf(pdf_file, pages=‘all’) —> displays all tables from all pages.

but when converting into a Pandas dataframe using tables = pd.DataFrame(pdf_file, pages = ‘all’, lattice = ‘True’)[0])—> display only the table on the first page.


Solution

  • The df that you receive from tabula should be in the form of a list.

    I also think that if you want to use pandas and tabula together the syntax should be something like below,

    df = pandas.DataFrame(tabula.read_pdf(pdffile, pages ='all')[0])
    

    If you want to utilize what you've gotten from tabula, you can also concatenate it into a single df as shown below

    dfs = tabula.read_pdf(pdf_file, pages=‘all’)
    df = pd.concat(dfs)
    

    If every table has it's own header, to skip the header for subsequent headers except for first header, try the following:

    import numpy as np
    
    dfFirstTable = tabula.read_pdf(pdffile)
    df = pd.DataFrame(np.concatenate(tabula.read_pdf(pdffile, pages ='all')), columns=dfFirstTable.columns)