I have extracted the tabular information from a scanned image(refer above screenshot) and getting the output as a string.
Here is the output
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC $0.00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82
The above output is a string where the data is misaligned, I'm looking for a solution where I can store the data into a Data Frame or Excel Sheet.
The column names from the above image can be interpreted as [Invoice Number, Invoice Date, Voucher ID, Gross Amount, Discounts, Late Charges, Paid Amount]
Looking forward for your help!
You can use regular expressions to parse the text document. For example:
import re
import pandas as pd
txt = """
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC $0.00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82
"""
pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
pat_reference = re.compile(r"Reference:\s*(\S+)")
pat_items = re.compile(
r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
flags=re.M,
)
pat_vendor = re.compile(
r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
)
pat_last = re.compile(
r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
flags=re.M | re.S,
)
data = {}
for row in pat_payment_date.findall(txt):
data["Payment Date"] = row
for row in pat_reference.findall(txt):
data["Reference"] = row
for row in pat_items.findall(txt):
data.setdefault("Items", []).append(list(row))
for row in pat_vendor.findall(txt):
data["Vendor"] = list(row)
for row in pat_last.findall(txt):
data["Total"] = list(row)
df = pd.DataFrame([data]).explode("Items")
print(df)
Prints:
Payment Date Reference Items Vendor Total
0 9/14/2020 0000232954 [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86] [1000028351, OFFICE DEPOT INC, $0.00] [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]
0 9/14/2020 0000232954 [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96] [1000028351, OFFICE DEPOT INC, $0.00] [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]
To create columns from list you can do then:
cols = [
"invoice_number",
"invoice_date",
"voucher_id",
"gross_amount",
"discounts",
"late_charges",
"paid_amount",
]
df = pd.concat(
[
df,
df.pop("Items")
.apply(lambda x: {c: v for v, c in zip(x, cols)})
.apply(pd.Series),
],
axis=1,
)
print(df)
Prints:
Payment Date Reference Vendor Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, $0.00] [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82] 102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, $0.00] [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82] 112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96