Search code examples
python-3.xpandasdataframepython-camelotpdf-extraction

How to use Camelot-py to split rows when text exist on a specific column


I am trying to extract table information from pdf using Camelot-py library. Initially using stream function like this:

import camelot
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1', columns=['110,400'], split_text=True, row_tol=10)
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df

However could not get the desired outcome, even after adjusting the columns value. Then I switched to lattice flavor. It can now determine the column accurately, however due to the nature that the pdf source does not separate rows using lines, the whole table content are extracted on one row.

Below using lattice:

import camelot
tables = camelot.read_pdf('sample_camelot_extract.pdf', flavor='lattice', pages='1')
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df

source file snapshot

The logic that I want to implement is that for each new text that exists on the first column (FIG ITEM), it should be the start of the new row.

Have tried both flavors but not sure which is the best approach.

Link for original file here:

Logic intended

Thank you.


Solution

  • You could try using pdfplumber - it allows you to customize all of its table extraction settings.

    For example - changing just the default horizontal strategy to text produces:

    table = page.extract_table(table_settings={"horizontal_strategy": "text"})
    
    [['FIG', '', '', 'EFFECT', 'UNITS'],
     ['ITEM', 'PART NUMBER', '1234567 NOMENCLATURE', 'FROM TO', 'PER\nASSY'],
     ['', '', '', '', ''],
     ['1', '', '', '', ''],
     ['', '', 'SYSTEM INSTL-AIR DISTR MIX', '', ''],
     ['', '', 'BAY (MAIN AIR', '', ''],
     ['', '', 'DISTRIBUTION ONLY)', '', ''],
    

    You could play around with more settings to see if it's possible to extract the whole table the way you intend.

    From here though - you could manually clean up and extract the column rows:

    >>> df = pd.DataFrame(table)
    >>> (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
    0                FIG ITEM
    1             PART NUMBER
    2    1234567 NOMENCLATURE
    3          EFFECT FROM TO
    4          UNITS PER ASSY
    dtype: object
    
    df.columns = (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
    df = df.tail(-3)
    

    You could then forward fill the FIG ITEM column and group on that - allowing you to combine the items.

    df.groupby(df["FIG ITEM"].replace("", float("nan")).ffill()).agg({
       "PART NUMBER": "first",
       "1234567 NOMENCLATURE": "\n".join,
       "UNITS PER ASSY": "first",
    })
    
                PART NUMBER                               1234567 NOMENCLATURE UNITS PER ASSY
    FIG ITEM                                                                                 
    - 1        M0DREF452754  SYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nDIS...             RF
    1                        \nSYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nD...               
    10          BACS12GU3K8                                             .SCREW             12
    15          BACS12GU3K9                                             .SCREW             18
    20         BACB30NM3K15                                              .BOLT             15
    27         BACB30NM3K17                                              .BOLT              2
    28         BACB30NM3K20                                              .BOLT              1
    30        NAS1149D0332J                                            .WASHER             60
    35          BACW10P44AL                                            .WASHER              2
    40              PLH53CD  .NUT-\nSUPPLIER CODE:\nVF0224\nSPECIFICATION N...              2
    45             SLT8LHC6  .STRAP-\nSUPPLIER CODE:\nV06383\nTRUE PART NUM...              7
    5           BACS12GU3K7                                             .SCREW             12