python image-processing ocr tesseract python-tesseract

image to text - remove non-ascii chars in python 2.7

I am using pytesser to OCR a small image and get a string from it:

image= Image.open(ImagePath)
text = image_to_string(image)
print text

However, the pytesser loves to sometimes recognize and return non-ascii characters. The problem occurs when I want to now print what I just recognized. In python 2.7 (which is what I am using), the program crashes.

Is there some way to make it so pytesser does not return any non-ascii characters? Perhaps there is something you can change in tesseract OCR?

Or, is there some way to test a string for non-ascii characters (without crashing the program) and then just not print that line?

Some would suggest using python 3.4 but from my research it looks like pytesser does not work with it: Pytesser in Python 3.4: name 'image_to_string' is not defined?

Solution

I would go with Unidecode. This library converts non-ASCII characters to most similar ASCII representation.

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

It should work perfectly!