python python-imaging-library tesseract python-tesseract pytesser

Numerical character recognition in Pytesser

I am working on a project that requires me to get prices from a commodity exchange. Unfortunately the exchange has no webservice or other plugin available that allows me to get the prices from the trading screen.

I figured that I could automatically make a screenshot of the prices and split all prices up in individual images. After that I process them with the Pytesser V 0.0.1 library for Tesseract 3.0.2 combined with Pillow 3.1.0 in Python v2.7. However, the conversion of the image to text (by the image_to_string function) is dramatic, as in most cases a 0 becomes an o or a 5 becomes an s and sometimes the conversion is random, which makes it difficult to just replace these characters. I have already resized the image to a larger size and used anti-aliasis, but the result does not get better. Is there a way to limit the set of characters to only digits and a dot for decimals? And how can the quality of the conversion be improved?

Perhaps my method is too tedious and you guys know a better way to do it? Your help is appreciated :)

Solution

Is there a way to limit the set of characters to only digits and a dot for decimals?

Yes! Using the package pyslibtesseract:

from pyslibtesseract import TesseractConfig, PageSegMode
config_line = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE)
config_line.add_variable('tessedit_char_whitelist', '0123456789.')

And how can the quality of the conversion be improved?

You need use OpenCV to improve the image quality.