Search code examples
pythonpandaspdfpypdftabula

Converting PDF document to DataFrame


I have a PDF document with 388 pages and 1 table per page , i am trying to get them converted to excel or multiple dataframes, but having some difficulties, i have tried pypdf2 and tabula libraries but it stops after extracting only one page. The data looks like this: enter image description here

All pages are the same but with different industry name and numbers

so far the best results i got are with

import tabula
import pandas as pd

df= pd.DataFrame()
df = tabula.read_pdf("FSA.pdf",multiple_tables=True)

tabula.convert_into("FSA.pdf", "fsa_report.csv", output_format="csv",multiple_tables=True)
print(df)

But it stops after completing page 1.Any help?


Solution

  • df = tabula.read_pdf(file, lattice=True, pages=2, multiple_tables=True)
    tabula.convert_into(file, "fsa_report.csv", output_format="csv", pages=3, multiple_tables=True)
    

    Use this line,You need to mentioned page count