Extracting data from PDFs into CSV

I would like to extract all data into CSV, available from pages 4 to 605 from this PDF. Some people kindly suggest me to use pypdf.

I don't know how to use it. The structure of the PDF is complicated. Sometimes some columns are empty, others values are not homogeneized (e.g. in the column "VALOR", sometimes numbers can be written "1450", or "1.450", or "1 4 5 0", etc.

I hope you can help me! Thank you anyway for your help.

P.S.: I'm still very inexperienced in Python, so please excuse any errors in terminology and/or anything else.

Solution

What you need ?

Indeed, pypdf is good in extracting text from PDF, from documentation :

pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

But in your case, tabula is what you need :

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.

Installation :

You can install tabula using pip:

pip install tabula-py

For faster execution install it with jpype:

pip install tabula-py[jpype]

How to use `tabula.read_pdf` :

tabula.read_pdf is the function we gonna use to extract tables from the PDF file. We are interested in the following Parameters :

input_path : The input file
pages pages: Extract from ("4-605" extract all table form 4 to 605)
pandas_options: Set pandas options
area : To be more efficient we need to specify the area where the table is in the pages in format (top,left,bottom,right)
columns : list of X coordinates of column boundaries.

Let do some measurement !!

Measurement, The fun part!

The picture is screenshot of page 5 of the PDF, all other pages are similar to this page (except page 4).

What we are interested in is the green area that contain the rows we need, in this case

area=[145, 70, 777, 525]

For page 4 the top is different:

area_p4=[250, 70, 777, 525]

We need also the X coordinates of column boundaries to avoid misplacement of columns values :

X=[120, 315, 365, 383, 405, 419, 444, 464, 485, 500, 530]

Note : the red region should be part of the green area , but `tabula` encounters difficulty managing it as multi-line column names are treated as separate rows.

We can set names using pandas options :

column_names = [
    "MARCA", "MODELO -TIPO", "PERIODOCOMERCIAL",
    "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf",
    "CO2 g/km", "cv", "2017 VALOR Euros"
]

pandas_options = {
    "header": 0,
    "names":column_names
}

You can name columns in better way !

Coding , The easy part !

Most of the work is done , what we need is extract tables for the PDF:

with open("file.pdf", 'rb') as file:
        # extract table from pages
        dfs = read_pdf(
            file,
            pages="5-605", # pages to extract table from
            area=area,
            columns=X,
            pandas_options=pandas_options,
        )

        # extract table in page 4
        df_4 = read_pdf(
              file,
              pages=4,
              area=area_p4, 
              columns=X,
              pandas_options=pandas_options,
        )

Then , we need to concatenate all dfs into one DataFrame using pd.concat:

# add page 4 table to the other tables
dfs.extend(df_4)

# concat all dfs to get one table
data = pd.concat(
    dfs,
    axis=0,
    ignore_index=True
)

Now , we can check some sample output :

data.sample(5)

index	MARCA	MODELO -TIPO	PERIODOCOMERCIAL	C.C.	Nº de cilind.	G/D	P kW	cvf	CO2 g/km	cv	2017 VALOR Euros
193	ALFA ROMEO	ALFA 159 1.9 JTD ti 150	2008-2012	1910.0	4.0	D	110	12,96	157.0	150.0	27300.0
172	ALFA ROMEO	ALFA 159 1.9 JTS Selective	2008-2012	1859.0	4.0	G	118	12,75	205.0	161.0	24300.0
200	ALFA ROMEO	ALFA 159 2.0 JTDM Elegante ECO	2008-2012	1956.0	4.0	D	125	13,14	136.0	170.0	24000.0
164	ALFA ROMEO	ALFA 156 Sportwagon 2.5 V6 Distinctive Q-System	2003-2006	2492.0	6.0	G	141	17,88	NaN	192.0	25100.0
102	ALFA ROMEO	ALFA 156 1.6 TS	1997-2002	1598.0	4.0	G	88	11,64	NaN	120.0	13500.0

Finally, You can find full code in this colab notebook

Edit :

Tabula a webapp friendly tools designed to run locally on your machine.

You can select tables manually or automatically, the have feature to Autodetect tables using Hough transform :

Next, we detect the boundaries of the table rows. If the table contains ruling lines to separate rows, we use their position to generate the boundaries (top and bottom) of each row. The lines are detected with a computer vision technique called Hough transform, as implemented in the OpenCV library.

You can use it to extract the data , or get the areas of table :

You can export the template as json , here is the output for page 4 :

[{"page":4,"extraction_method":"guess","x1":75.15359512329101,"x2":522.3546908569336,"y1":250.3879679107666,"y2":782.4158937835693,"width":447.2010957336426,"height":532.0279258728027}]