Search code examples
pythonpdftabulatabula-py

data missing while reading pdf file using tabula and python


I have a pdf with several text and tables and one row contains like below :

PDF content :
Id: 5647484848 Name Alex J

Now I am using tabula-py for parsing the content, but the result is missing something (means you can see first charater or number is missing).

Actually my original pdf is having lots of text and tables. I tried on other rows too, where i exactly get the right result.

Wrong Result :
['', '', 'Id:', '', '647484848', 'Name', '', 'lex J', '', '', '']

Should be :
['', '', 'Id:', '', '5647484848', 'Name', '', 'Alex J', '', '', '']

Sample :

# to get the exact row to find the name & index [7] is for Name
if len(row) == 11:
    if "Name" in row:
       print(row[7])
       return Student(studentname=row[7])

In tabula while reading table, I have set

df = tabula.read_pdf(pdf, output_format='json', pages='all',
                          password=secure_password, lattice=True)

The row is simple text type , no images and all. Don't know why it fails for this particular row data. I have applied similar logic to other rows where i got proper result. Please suggest.


Solution

  • Solved by changing extraction mode in tabula-py from lattice=True to lattice=False

    df = tabula.read_pdf(pdf, output_format='json', pages='all',
                              password=secure_password, lattice=False)