Search code examples
pythonpdfpypdf

Extracting data from PDFs into CSV


I would like to extract all data into CSV, available from pages 4 to 605 from this PDF. Some people kindly suggest me to use pypdf.

I don't know how to use it. The structure of the PDF is complicated. Sometimes some columns are empty, others values are not homogeneized (e.g. in the column "VALOR", sometimes numbers can be written "1450", or "1.450", or "1 4 5 0", etc.

I hope you can help me! Thank you anyway for your help.

P.S.: I'm still very inexperienced in Python, so please excuse any errors in terminology and/or anything else.


Solution

  • What you need ?

    Indeed, pypdf is good in extracting text from PDF, from documentation :

    pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

    But in your case, tabula is what you need :

    tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.

    Installation :

    You can install tabula using pip:

    pip install tabula-py
    

    For faster execution install it with jpype:

    pip install tabula-py[jpype]
    

    How to use tabula.read_pdf :

    tabula.read_pdf is the function we gonna use to extract tables from the PDF file. We are interested in the following Parameters :

    • input_path : The input file
    • pages pages: Extract from ("4-605" extract all table form 4 to 605)
    • pandas_options: Set pandas options
    • area : To be more efficient we need to specify the area where the table is in the pages in format (top,left,bottom,right)
    • columns : list of X coordinates of column boundaries.

    Let do some measurement !!

    Measurement, The fun part!

    The picture is screenshot of page 5 of the PDF, all other pages are similar to this page (except page 4).

    What we are interested in is the green area that contain the rows we need, in this case

    area=[145, 70, 777, 525]
    

    For page 4 the top is different:

    area_p4=[250, 70, 777, 525]
    

    enter image description here

    We need also the X coordinates of column boundaries to avoid misplacement of columns values :

    X=[120, 315, 365, 383, 405, 419, 444, 464, 485, 500, 530]
    

    Note : the red region should be part of the green area , but tabula encounters difficulty managing it as multi-line column names are treated as separate rows.

    We can set names using pandas options :

    column_names = [
        "MARCA", "MODELO -TIPO", "PERIODOCOMERCIAL",
        "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf",
        "CO2 g/km", "cv", "2017 VALOR Euros"
    ]
    
    pandas_options = {
        "header": 0,
        "names":column_names
    }
    

    You can name columns in better way !

    Coding , The easy part !

    Most of the work is done , what we need is extract tables for the PDF:

    with open("file.pdf", 'rb') as file:
            # extract table from pages
            dfs = read_pdf(
                file,
                pages="5-605", # pages to extract table from
                area=area,
                columns=X,
                pandas_options=pandas_options,
            )
    
            # extract table in page 4
            df_4 = read_pdf(
                  file,
                  pages=4,
                  area=area_p4, 
                  columns=X,
                  pandas_options=pandas_options,
            )
    

    Then , we need to concatenate all dfs into one DataFrame using pd.concat:

    # add page 4 table to the other tables
    dfs.extend(df_4)
    
    # concat all dfs to get one table
    data = pd.concat(
        dfs,
        axis=0,
        ignore_index=True
    )
    

    Now , we can check some sample output :

    data.sample(5)
    
    index MARCA MODELO -TIPO PERIODOCOMERCIAL C.C. Nº de cilind. G/D P kW cvf CO2 g/km cv 2017 VALOR Euros
    193 ALFA ROMEO ALFA 159 1.9 JTD ti 150 2008-2012 1910.0 4.0 D 110 12,96 157.0 150.0 27300.0
    172 ALFA ROMEO ALFA 159 1.9 JTS Selective 2008-2012 1859.0 4.0 G 118 12,75 205.0 161.0 24300.0
    200 ALFA ROMEO ALFA 159 2.0 JTDM Elegante ECO 2008-2012 1956.0 4.0 D 125 13,14 136.0 170.0 24000.0
    164 ALFA ROMEO ALFA 156 Sportwagon 2.5 V6 Distinctive Q-System 2003-2006 2492.0 6.0 G 141 17,88 NaN 192.0 25100.0
    102 ALFA ROMEO ALFA 156 1.6 TS 1997-2002 1598.0 4.0 G 88 11,64 NaN 120.0 13500.0

    Finally, You can find full code in this colab notebook

    Edit :

    Tabula a webapp friendly tools designed to run locally on your machine.

    You can select tables manually or automatically, the have feature to Autodetect tables using Hough transform :

    Next, we detect the boundaries of the table rows. If the table contains ruling lines to separate rows, we use their position to generate the boundaries (top and bottom) of each row. The lines are detected with a computer vision technique called Hough transform, as implemented in the OpenCV library.

    You can use it to extract the data , or get the areas of table :

    enter image description here

    You can export the template as json , here is the output for page 4 :

    [{"page":4,"extraction_method":"guess","x1":75.15359512329101,"x2":522.3546908569336,"y1":250.3879679107666,"y2":782.4158937835693,"width":447.2010957336426,"height":532.0279258728027}]