Search code examples
pythonregexvisionorc

Extract data from txt with regex


I just used Google Vision API to convert a pdf receipt to a txt file. Now I would like to extract 4 specific fields and save those in a new txt file.

I highlighted 2 examples of the items I woul like to extract

yellow: product ID ;green: quantity ;blue: unit price ;Red: product description enter image description here

Here is a piece of the text file:

['Waiting for the operation to finish.\n', 'Output files:\n', 'receipts/factura_lider.txtoutput-1-to-1.json\n', 'Full text:\n', '\n', 'ADMIN. DE SUPERMERCADOS HIPER\n', 'LIMITADA\n', '76.134.941-4\n', 'Hiper\n', 'LIDER\n', 'GRANDES ESTABLECIMIENTOS-VENTA DE\n', 'ALIM.\n', 'BOLETA ELECTRÓNICA N° : 1680178292\n', 'LOCAL:\n', '0682\n', 'CAJA:\n', '020\n', 'CAJERO:\n', '163\n', 'FECHA EMISION:\n', '20-10-2020\n', 'HORA:\n', '09:07\n', 'TRAN. Nº:\n', '0018\n', 'CANT.\n', 'PRECIO UNIT.\n', 'DESC. ARTICULO\n', 'VALOR\n', '2.150\n', '4.511\n', '1.690\n', '5.990\n', '1.990\n', '1.190\n', '309\n', '2.490\n', '4.290\n', '2.650\n', '2.290\n', '4.500\n', '3.840\n', '1.416\n', 'CODIGO: 07803473002662\n', '1.0x 2.150 ID PAN BLA G\n', 'CODIGO: 02069600000009\n', '0.515x 8.759 PAVO PECHUGA\n', 'CODIGO: 00078742086811\n', '1.0X 1.690 MARGARIN REG\n', 'CODIGO: 07613036150521\n', '1.0X 5.990 BUEN DIA 1.1\n', 'CODIGO: 07804115001838\n', '1.0x 1.990 AZ-MOL TR PA\n', 'CODIGO: 07802920801704\n', '1.0 1.190 YOGHURT DAMA\n', 'CODIGO: 07804646490194\n', '1.0x 309 CILANTRO BOL\n', 'CODIGO: 00614143030932\n', '1.0x 2.490 FRUTOS BOS\n', 'CODIGO: 07804100103158\n', '1.0x 4.290 PACK HUEVO M\n', 'CODIGO: 07801930000 602\n', '1.0x 2.650 PANCETAPF\n', 'CODIGO: 07804152000283\n', '1.0x 2.290 NARANJA 1.5\n', 'CODIGO: 07805000183080\n', '1.0X 4.500 DET.LQ.DPLIR\n', 'CODIGO: 02164730000001\n', '2.415X 1.590 POLLO ENTERO\n', 'CODIGO: 02000140000005\n', '1.43% 990 PLATANO\n', 'CODIGO: 07804653341021\n', '1.0X 1.000 PHX6\n', 'CODIGO: 07802655002230\n', '1.0x 830 HARINA S/POL\n',


Solution

  • import re
    
    text = '''['Waiting for the operation to finish.\n', 'Output files:\n', 'receipts/factura_lider.txtoutput-1-to-1.json\n', 'Full text:\n', '\n', 'ADMIN. DE SUPERMERCADOS HIPER\n', 'LIMITADA\n', '76.134.941-4\n', 'Hiper\n', 'LIDER\n', 'GRANDES ESTABLECIMIENTOS-VENTA DE\n', 'ALIM.\n', 'BOLETA ELECTRÓNICA N° : 1680178292\n', 'LOCAL:\n', '0682\n', 'CAJA:\n', '020\n', 'CAJERO:\n', '163\n', 'FECHA EMISION:\n', '20-10-2020\n', 'HORA:\n', '09:07\n', 'TRAN. Nº:\n', '0018\n', 'CANT.\n', 'PRECIO UNIT.\n', 'DESC. ARTICULO\n', 'VALOR\n', '2.150\n', '4.511\n', '1.690\n', '5.990\n', '1.990\n', '1.190\n', '309\n', '2.490\n', '4.290\n', '2.650\n', '2.290\n', '4.500\n', '3.840\n', '1.416\n', 'CODIGO: 07803473002662\n', '1.0x 2.150 ID PAN BLA G\n', 'CODIGO: 02069600000009\n', '0.515x 8.759 PAVO PECHUGA\n', 'CODIGO: 00078742086811\n', '1.0X 1.690 MARGARIN REG\n', 'CODIGO: 07613036150521\n', '1.0X 5.990 BUEN DIA 1.1\n', 'CODIGO: 07804115001838\n', '1.0x 1.990 AZ-MOL TR PA\n', 'CODIGO: 07802920801704\n', '1.0 1.190 YOGHURT DAMA\n', 'CODIGO: 07804646490194\n', '1.0x 309 CILANTRO BOL\n', 'CODIGO: 00614143030932\n', '1.0x 2.490 FRUTOS BOS\n', 'CODIGO: 07804100103158\n', '1.0x 4.290 PACK HUEVO M\n', 'CODIGO: 07801930000 602\n', '1.0x 2.650 PANCETAPF\n', 'CODIGO: 07804152000283\n', '1.0x 2.290 NARANJA 1.5\n', 'CODIGO: 07805000183080\n', '1.0X 4.500 DET.LQ.DPLIR\n', 'CODIGO: 02164730000001\n', '2.415X 1.590 POLLO ENTERO\n', 'CODIGO: 02000140000005\n', '1.43% 990 PLATANO\n', 'CODIGO: 07804653341021\n', '1.0X 1.000 PHX6\n', 'CODIGO: 07802655002230\n', '1.0x 830 HARINA S/POL\n'
    '''
    
    Product = re.search('CODIGO:(.*?)\n\', \'(.*?)\n', text, re.DOTALL)
    product_ID = Product.group(1)
    q_up_pd_str = Product.group(2).split()
    quantity = q_up_pd_str[0]
    unit_price  = q_up_pd_str[1]
    product_description = ' '.join(q_up_pd_str[2:])
    print(product_ID)
    print(quantity)
    print(unit_price) 
    print(product_description)