Search code examples
pythonpython-3.xstringdataframeocr

Data extraction from a string


enter image description here

I have extracted the tabular information from a scanned image(refer above screenshot) and getting the output as a string.

Here is the output

enter image description here

Payment Date: 9/14/2020 Reference: 0000232954                                                                
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
Vendor Number Name Bank Charge Transfer Cost Cd                                                              
1000028351 OFFICE DEPOT INC $0.00                                                          
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                   
0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82

The above output is a string where the data is misaligned, I'm looking for a solution where I can store the data into a Data Frame or Excel Sheet.

The column names from the above image can be interpreted as [Invoice Number, Invoice Date, Voucher ID, Gross Amount, Discounts, Late Charges, Paid Amount]

Looking forward for your help!


Solution

  • You can use regular expressions to parse the text document. For example:

    import re
    import pandas as pd
    
    txt = """
    Payment Date: 9/14/2020 Reference: 0000232954                                                                
    Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
    102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
    112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
    Vendor Number Name Bank Charge Transfer Cost Cd                                                              
    1000028351 OFFICE DEPOT INC $0.00                                                          
    Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                       
    0000232954 Sep/14/2020 $1,575.82 $0.00 $0.00 $1,575.82
    """
    
    pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
    pat_reference = re.compile(r"Reference:\s*(\S+)")
    
    pat_items = re.compile(
        r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
        flags=re.M,
    )
    pat_vendor = re.compile(
        r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
    )
    pat_last = re.compile(
        r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
        flags=re.M | re.S,
    )
    
    data = {}
    for row in pat_payment_date.findall(txt):
        data["Payment Date"] = row
    for row in pat_reference.findall(txt):
        data["Reference"] = row
    for row in pat_items.findall(txt):
        data.setdefault("Items", []).append(list(row))
    for row in pat_vendor.findall(txt):
        data["Vendor"] = list(row)
    for row in pat_last.findall(txt):
        data["Total"] = list(row)
    
    df = pd.DataFrame([data]).explode("Items")
    print(df)
    

    Prints:

      Payment Date   Reference                                                              Items                                 Vendor                                                          Total
    0    9/14/2020  0000232954   [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86]  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]
    0    9/14/2020  0000232954  [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96]  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]
    

    To create columns from list you can do then:

    cols = [
        "invoice_number",
        "invoice_date",
        "voucher_id",
        "gross_amount",
        "discounts",
        "late_charges",
        "paid_amount",
    ]
    df = pd.concat(
        [
            df,
            df.pop("Items")
            .apply(lambda x: {c: v for v, c in zip(x, cols)})
            .apply(pd.Series),
        ],
        axis=1,
    )
    print(df)
    

    Prints:

      Payment Date   Reference                                 Vendor                                                          Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
    0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]   102554463001   Jul/062020   21002450       699.86      0.00         0.00      699.86
    0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, $0.00]  [0000232954, Sep/14/2020, $1,575.82, $0.00, $0.00, $1,575.82]   112942431001  Aug/12/2020   21002565       875.96      0.00         0,00      875.96