Search code examples
pandaspython-imaging-librarystringiopytesser

Convert pytesseract string output to pandas df


I have been given receipts from Subway detailing sales, workers, etc throughout the day and need to extract the data for a management class.

I took pictures of the receipts and processed them with pytesseract into a string separated by \n but now don't know how to use pd.read_csv and StringIO to transform it into a dataframe. Don't if this is the best way to go about it. Also may need to edit the image using cv2 so that it processes better.

import numpy as np
import pytesseract
from PIL import Image
import pandas as pd

path = 'C:\\attachments\\'

monday = pytesseract.image_to_string(Image.open(path+'file1-1.jpeg'),lang='eng')

from StringIO import StringIO
mon = pd.read_csv(StringIO(monday),sep=r'\s',lineterminator=r'\n')
print(mon)

This is some of the variable monday currently.

"\nTIME HOURS :\nPERIOD SALES UNITS WORKED PROD SPLH\nZhan emmoo «Ct (iti ;:t‘«é‘«‘i CSD\n3A-4A $0.00 0 0 0 $0.00\n44-54 =: $0.00 SssOO 0 0 $0.00\n5A-6A $0.00 0 0 0 $0.00\nbA-7A $0.00 0 0 0 $0.00\n7A-BA =s«$0.00-Sss«OOs«*O0.80 0 $0.00\nBA-9A 60,00 . Qge2.00 0 $0.00\nQA-10A $33.68 6 2,00 3.00 $16.84\n104-114 $61.07 9 2.13 4.23 $28.67\n11A-12P$238.82 33 5,00 6.60 $47.76"

It should look like this as a dataframe:

Period Sales Units Worked Prod SPLH
3A-4A  $0.00  0      0     0   $0.00
bA-7A  $0.00  0      0     0   $0.00

Solution

  • You may get the results from tesseract directly into a Pandas dataframe:

    monday = pytesseract.image_to_data(Image.open(path+'file1-1.jpeg'),lang='eng', output_type='data.frame')
    

    Now monday is a dataframe which, however, needs more processing from you, as it contains at least a row for each level in the hierarchy. Check the output and see how you wish to organize it.