Search code examples
pythonpandaspdfrangetabula

How to make page range in tabula-py?


In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores.pdf" with 6,041 pages. I'm on a machine with Ubuntu. The file is here: https://drive.google.com/file/d/1P8kF0gUOVls6sOGed4R0C2PlVF5RFtU6/view?usp=sharing

On each page there is text at the top of the page, two lines. And below a table, with header and two columns. Each table in 36 rows, less on the last page

At the end of each page, after the tables, there is also a line of text

I want to create a CSV from this PDF, considering only the tables in the pages. And ignoring the texts before and after the tables

To avoid java-memory errors I thought I'd split the file into groups of 300 pages. I did so in tabula-py:

import tabula
import pandas as pd


dfs = []

for i in range(1,6041, 300):
    if i != 1:
        i = i + 1

    i2 = i + 300

    if i2 > 6041:
        i2 = 6041

    print(i)
    print(i2)

    try:
        df = tabula.read_pdf("Ativos_Fevereiro_2018.pdf", encoding='latin-1', spreadsheet=True, pages='i-i2', header=0)
        dfs.append(df)
        print('Page ', len(df), ' parsed.')
    except:
        print('Error on page: ', i)

output = pd.concat(dfs)
output.to_csv('servidores_rj_ativos_fev_18.csv', encoding='utf-8', index=False)

But the range I made is wrong:

1
301
Error: Syntax error in page range specification
Error on page:  1
302
602
...
Error: Syntax error in page range specification
Error on page:  5702
6002
6041
Error: Syntax error in page range specification
Error on page:  6002
Traceback (most recent call last):
  File "roboseguranca_pdftocsv.py", line 26, in <module>
    output = pd.concat(dfs)
  File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 212, in concat
    copy=copy)
  File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 245, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

Please, how can I correct the range error?


Solution

  • for the range to work you have to pass it as a string, so convert the integers to strings and combine them with '-':

    pages=(str(i)+'-'+str(i2))
    

    Some other things:

    • use also encoding='utf-8' in the tabula.read_pdf statement
    • If you want see too what error is thrown extend the except statement, e.g.:

    except Exception as e: print('Error in range ', i, '-', 'i2: ', e)