Search code examples
pythonpandaspdfplumber

pdfplumber: Merging multi-line cells into one


So I'm struggling a little bit on one particular aspect of my data extraction script, i've successfully got the script to extract the data but i'm not sure how to deal with merging multi-line cells. For example: I have some data that is spread across two rows (highlighted)

enter image description here

but i want them merged in to one row in to the appropriate column. so instead of:

AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.

it reads AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.

I want this to apply across all columns.

Here is the script that i've written so far:

import pdfplumber
import pandas as pd

pdf_path = "C:/Users/xxxx/Downloads/relev-aronefs-11-12-2023.pdf"
all_rows = []
explicit_vertical_lines = [45, 100, 250, 410, 480, 640, 780]

with pdfplumber.open(pdf_path) as pdf:
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]

        if page_num == 0:
            top_region = 50 
        else:  # For all other pages
            top_region = 0

        cropped_page = page.crop((0, top_region, page.width, page.height))

        table = cropped_page.extract_table({
            "vertical_strategy": "explicit",
            "explicit_vertical_lines": explicit_vertical_lines,
            "horizontal_strategy": "text",
        })

        if table:
            if page_num == 0:
                all_rows += table[1:] 
            else:
                all_rows += table

df = pd.DataFrame(all_rows)

df.columns = ['Immat', 'Constructeur', 'Type d’aéronef', 'SN aéronef', 'Propriétaire', 'Exploitant']

output_csv_path = "C:/Users/xxxx/Downloads/extracted_table_data.csv"
df.to_csv(output_csv_path, index=False)

print(df.head())

If you want to download the pdf i used then its available here:

https://dac.gouvernement.lu/content/dam/gouv_dac/documents/navigabilit%C3%A9/releve-aeronefs/2023/relev-aronefs-11-12-2023.pdf

Any help/guidance would be appreciated.


Solution

  • If you can't avoid the empty rows upfront, here is a quick solution :

    out = (
        df
            # to get rid of the duplicated headers and footers
            .query("Immat != 'Immat' and ~Exploitant.str.contains('Page')")
            .pipe( # here we ffill the Immat, then groupby & agg/join
                lambda x: x.groupby(
                    x["Immat"].replace(
                        {"": None}).ffill(), sort=False)
                    .agg(lambda s: " ".join(s).strip()) # << we join the rows here
                 )
            # to make a default index
            .reset_index(drop=True)
    )
    

    Output (of the highlighted entry) :

    Immat Constructeur Type d’aéronef SN aéronef Propriétaire Exploitant
    2 LX-AIB THE NEW PIPER AIRCRAFT, INC PA28-161 Cadet 28-41086 AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L. AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.

    Ouptut (truncated):

    print(out)
    
          Immat Constructeur Type d’aéronef SN aéronef Propriétaire   Exploitant
    0    LX-ABC  BOMBARDI...  BD-700-1...         9213  DMH AVIA...  GLOBAL J...
    1    LX-AIA  THE NEW ...        PA18C      18-1011  PIPER CL...  PIPER CL...
    2    LX-AIB  THE NEW ...  PA28-161...     28-41086  AÉRO-SPO...  AÉRO-SPO...
    3    LX-AIC  REIMS AV...        F172L         0852  AÉRO-SPO...  AÉRO-SPO...
    4    LX-AID  REIMS AV...        F172N         1972  AÉRO-SPO...  AÉRO-SPO...
    ..      ...          ...          ...          ...          ...          ...
    267  LX-XMR  STORCH A...  Slepcev ...          136  PROPRIÉT...  EXPLOITA...
    268  LX-YCV  BOEING C...    B747-4R7F        35805  CARGOLUX...  CARGOLUX...
    269  LX-ZAP  PILATUS ...        PC-24          315  ZAPLEX S.A.  JETFLY A...
    270  LX-ZEN  CIRRUS D...         SR22         2972  PROPRIÉT...  EXPLOITA...
    271  LX-ZIO  BOMBARDI...  BD-700-1...         9369  ER FRANZ...  GLOBAL J...
    
    [272 rows x 6 columns]