I would like to extract all data into CSV, available from pages 4 to 605 from this PDF. Some people kindly suggest me to use pypdf
.
I don't know how to use it. The structure of the PDF is complicated. Sometimes some columns are empty, others values are not homogeneized (e.g. in the column "VALOR", sometimes numbers can be written "1450", or "1.450", or "1 4 5 0", etc.
I hope you can help me! Thank you anyway for your help.
P.S.: I'm still very inexperienced in Python, so please excuse any errors in terminology and/or anything else.
Indeed, pypdf is good in extracting text from PDF, from documentation :
pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.
But in your case, tabula is what you need :
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.
You can install tabula
using pip
:
pip install tabula-py
For faster execution install it with jpype
:
pip install tabula-py[jpype]
tabula.read_pdf
:tabula.read_pdf
is the function we gonna use to extract tables from the PDF file. We are interested in the following Parameters :
Let do some measurement !!
The picture is screenshot of page 5 of the PDF, all other pages are similar to this page (except page 4).
What we are interested in is the green area that contain the rows we need, in this case
area=[145, 70, 777, 525]
For page 4 the top is different:
area_p4=[250, 70, 777, 525]
We need also the X coordinates of column boundaries to avoid misplacement of columns values :
X=[120, 315, 365, 383, 405, 419, 444, 464, 485, 500, 530]
tabula
encounters difficulty managing it as multi-line column names are treated as separate rows.We can set names using pandas options :
column_names = [
"MARCA", "MODELO -TIPO", "PERIODOCOMERCIAL",
"C.C.", "Nº de cilind.", "G/D", "P kW", "cvf",
"CO2 g/km", "cv", "2017 VALOR Euros"
]
pandas_options = {
"header": 0,
"names":column_names
}
You can name columns in better way !
Most of the work is done , what we need is extract tables for the PDF:
with open("file.pdf", 'rb') as file:
# extract table from pages
dfs = read_pdf(
file,
pages="5-605", # pages to extract table from
area=area,
columns=X,
pandas_options=pandas_options,
)
# extract table in page 4
df_4 = read_pdf(
file,
pages=4,
area=area_p4,
columns=X,
pandas_options=pandas_options,
)
Then , we need to concatenate all dfs into one DataFrame using pd.concat:
# add page 4 table to the other tables
dfs.extend(df_4)
# concat all dfs to get one table
data = pd.concat(
dfs,
axis=0,
ignore_index=True
)
Now , we can check some sample output :
data.sample(5)
index | MARCA | MODELO -TIPO | PERIODOCOMERCIAL | C.C. | Nº de cilind. | G/D | P kW | cvf | CO2 g/km | cv | 2017 VALOR Euros |
---|---|---|---|---|---|---|---|---|---|---|---|
193 | ALFA ROMEO | ALFA 159 1.9 JTD ti 150 | 2008-2012 | 1910.0 | 4.0 | D | 110 | 12,96 | 157.0 | 150.0 | 27300.0 |
172 | ALFA ROMEO | ALFA 159 1.9 JTS Selective | 2008-2012 | 1859.0 | 4.0 | G | 118 | 12,75 | 205.0 | 161.0 | 24300.0 |
200 | ALFA ROMEO | ALFA 159 2.0 JTDM Elegante ECO | 2008-2012 | 1956.0 | 4.0 | D | 125 | 13,14 | 136.0 | 170.0 | 24000.0 |
164 | ALFA ROMEO | ALFA 156 Sportwagon 2.5 V6 Distinctive Q-System | 2003-2006 | 2492.0 | 6.0 | G | 141 | 17,88 | NaN | 192.0 | 25100.0 |
102 | ALFA ROMEO | ALFA 156 1.6 TS | 1997-2002 | 1598.0 | 4.0 | G | 88 | 11,64 | NaN | 120.0 | 13500.0 |
Finally, You can find full code in this colab notebook
Tabula a webapp friendly tools designed to run locally on your machine.
You can select tables manually or automatically, the have feature to Autodetect tables using Hough transform :
Next, we detect the boundaries of the table rows. If the table contains ruling lines to separate rows, we use their position to generate the boundaries (top and bottom) of each row. The lines are detected with a computer vision technique called Hough transform, as implemented in the OpenCV library.
You can use it to extract the data , or get the areas of table :
You can export the template as json , here is the output for page 4 :
[{"page":4,"extraction_method":"guess","x1":75.15359512329101,"x2":522.3546908569336,"y1":250.3879679107666,"y2":782.4158937835693,"width":447.2010957336426,"height":532.0279258728027}]