Data extraction from a string

I have extracted the tabular information from a scanned image(refer above screenshot) and getting the output as a string.

Here is the output

Payment Date: 9/14/2020 Reference: 0000232954                                                                
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
Vendor Number Name Bank Charge Transfer Cost Cd                                                              
1000028351 OFFICE DEPOT INC $0.00                                                          
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                   
0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82

The above output is a string where the data is misaligned, I'm looking for a solution where I can store the data into a Data Frame or Excel Sheet.

The column names from the above image can be interpreted as [Invoice Number, Invoice Date, Voucher ID, Gross Amount, Discounts, Late Charges, Paid Amount]

Looking forward for your help!

Solution

You can use regular expressions to parse the text document. For example:

import re
import pandas as pd

txt = """
Payment Date: 9/14/2020 Reference: 0000232954                                                                
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
Vendor Number Name Bank Charge Transfer Cost Cd                                                              
1000028351 OFFICE DEPOT INC $0.00                                                          
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                   
0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82
"""

pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
pat_reference = re.compile(r"Reference:\s*(\S+)")

pat_items = re.compile(
    r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
    flags=re.M,
)
pat_vendor = re.compile(
    r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
)
pat_last = re.compile(
    r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
    flags=re.M | re.S,
)

data = {}
for row in pat_payment_date.findall(txt):
    data["Payment Date"] = row
for row in pat_reference.findall(txt):
    data["Reference"] = row
for row in pat_items.findall(txt):
    data.setdefault("Items", []).append(list(row))
for row in pat_vendor.findall(txt):
    data["Vendor"] = list(row)
for row in pat_last.findall(txt):
    data["Total"] = list(row)

df = pd.DataFrame([data]).explode("Items")
print(df)

Prints:

  Payment Date   Reference                                                              Items                                 Vendor                                                          Total
0    9/14/2020  0000232954   [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86]  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]
0    9/14/2020  0000232954  [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96]  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]

To create columns from list you can do then:

cols = [
    "invoice_number",
    "invoice_date",
    "voucher_id",
    "gross_amount",
    "discounts",
    "late_charges",
    "paid_amount",
]
df = pd.concat(
    [
        df,
        df.pop("Items")
        .apply(lambda x: {c: v for v, c in zip(x, cols)})
        .apply(pd.Series),
    ],
    axis=1,
)
print(df)

Prints:

  Payment Date   Reference                                 Vendor                                                          Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]   102554463001   Jul/062020   21002450       699.86      0.00         0.00      699.86
0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]   112942431001  Aug/12/2020   21002565       875.96      0.00         0,00      875.96