While reading a pdf file using df = tabula.read_pdf(pdf_file, pages=‘all’) —> displays all tables from all pages.
but when converting into a Pandas dataframe using tables = pd.DataFrame(pdf_file, pages = ‘all’, lattice = ‘True’)[0])—> display only the table on the first page.
The df that you receive from tabula should be in the form of a list.
I also think that if you want to use pandas and tabula together the syntax should be something like below,
df = pandas.DataFrame(tabula.read_pdf(pdffile, pages ='all')[0])
If you want to utilize what you've gotten from tabula, you can also concatenate it into a single df as shown below
dfs = tabula.read_pdf(pdf_file, pages=‘all’)
df = pd.concat(dfs)
If every table has it's own header, to skip the header for subsequent headers except for first header, try the following:
import numpy as np
dfFirstTable = tabula.read_pdf(pdffile)
df = pd.DataFrame(np.concatenate(tabula.read_pdf(pdffile, pages ='all')), columns=dfFirstTable.columns)