python regex machine-learning ocr tesseract

Receipt reading using TesserOCR

My roomates and I are tired of manually splitting receipts everytime we go grocery shopping (specially Costco), so I want to make a receipt splitter using image recognition.

I'm using Tesserocr for python to convert pictures of the receipt into text, then match the text using regex and do calculations from there. The problem is that tesseract does a terrible job at converting image to text. Here's is a picture of one of our receipt, and here's the output after using api.SetImageFile(img) and api.GetUTF8Text():

”Wm-
Belfsville #214j
|0925 Balfimore Rve. (R1.
Belfsvlll B, MD 20705
4P Member 111869052983
E 1952 SNEET8SRTLY 11.79
E 0000165287 CPN/1952 3.80“
E 1952 SNEET&SHTLY 11.79
E 0000165287 CPN/1952 3.80-
87745 ROTISSERIE 4.99 H
1 5597 BLUEBERRIES 6.99
E. 5597 BIUEBERRIES 6.99
E. 979210 CHOC MRNGOS 9.99 H
F‘ 24311 VHR. MUFFIN 7.99
1 1060788 PRETZELCRISP 6.89
87745 ROTISSERIE 4.99 H
- 87745 ROTISSERIE 4.99 H
EZ 71096 RED DEL 7.99
El 1027557 KOREHNNOODLE 8.79
Ei 382861 KS IN CK BST 16.79
[S 91610 FROSTED FLKS 6.79
[3 11422 3 YR CHDR 12_27
[5 46849 SESNDPRKTEND 12.55
SUBTOTRL 13 _
THX 1,33
xu** TOTAL IIIIIIIBEEIHﬂI
xxxxx XXXXXXX4540 CHIP Read

You can see that the output is kind of hard to work with. It reads "A" as "H" and sometimes reads "E" as "F" or other random stuff. I think I have two options:

Somehow train tesseract to read the receipts better, but I have no previous experience with machine learning. I tried to read up on Tesseract's trainning guide, but there's a lot of technicalities I'm not familar with. I'd imagine the actual process is not difficult though, since the images I'm reading are very specific.
Take multiple pictures of the recipt, use something like Fred's ImageMagick Scripts, put all the pictures through different filters, put all permutation of the pictures through tesseract and consolidate a result. The problem with this is 1) I'm not sure how the consolidation can be done. It would be difficult with regex. And 2) I imagine there will still be base line issues, like reading "A" as "H".

Can anyone help me with either of these options; point me to the path to get this done? Or enlighten me on another approach I can try?

Solution

If you can use ImageMagick and are on a Unix-like system (Linux, MacOSX, Windows w/Cygwin or Windows 10 unix environment), then you could try my bash shell scripts, textdeskew and textcleaner at http://www.fmwconcepts.com/imagemagick/index.php. For example:

textdeskew input.jpg deskew.png

deskew result

and then

textcleaner -f 25 -o 10 -g -e normalize -s 1 deskew.png deskew_clean.png

deskew and clean result

Or on any OS in ImageMagick, just use -deskew and -lat:

convert input.jpg -deskew 40% input_deskew.png

Imagemagick deskew result

convert input_deskew.png -negate -lat 25x25+10% -negate input_deskew_lat.png

Imagemagick deskew and lat result

Or run them together as:

convert input.jpg -deskew 40% -negate -lat 25x25+10% -negate input_deskew_lat.png

Do any of those help your OCR?