Search code examples
pythonpython-tesseract

Inconsistent Pytesseract


I have a directory full of images and want to extract the value from part of it.

I won't bother you with the efforts to extract the exact position of the text from the original image. It's just a convolve function.

Here's an example of it working:

Extracted text (this is actually a numpy array of T/F saved as an image with matplotlib imsave(name,image,cmap='gray')):

Extracted Text

If I now run

pytesseract.image_to_string(image2)

or

pytesseract.image_to_string(image2,config="--psm 7")

the result is '3 000 x' as expected.

Here's an example of it failing:

Extracted text (this is actually a numpy array of T/F saved as an image with matplotlib imsave(name,image,cmap='gray')):

imageText

If I now run

pytesseract.image_to_string(image2)

or

pytesseract.image_to_string(image2,config="--psm 7")

the result is 'i imol els 4'

It seems odd to me that there'd be such a big difference for such a similar process. Are there parameters to help pytesseract, eg the expected size of the characters, the format, etc?

PS: My current solution to this problem is to use a convolve function comparing it with a directory of samples that I've already read manually (my personal OCR is better though slower than pytesseract!). This is adequate, but it would be nice to have an additional level of automation!


Solution

  • I invert your image and then run this command.

    tesseract hluZr.png stdout -l eng --oem 3 --psm 6
    1508 x