Search code examples
pythonautomationpython-tesseract

How to read numbers on screen efficiently (pytesseract)?


I'm trying to read numbers on the screen and for that I'm using pytesseract. The thing is, even though it works, it works slowly and doesn't give good results at all. for example, with this image:

base image

I can make this thresholded image:

thresholded iamge

and it reads 5852 instead of 585, which is understandable, but sometimes it can be way worse with different thresholding. It can read 1 000 000 as 1 aaa eee for example, or 585 as 5385r (yes it even adds characters without any reason)

Isn't any way to force pytesseract to read only numbers or simply use something that works better than pytesseract?

my code:

from PIL import Image
from pytesseract import pytesseract as pyt
import test
pyt.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'

def tti2(location) :
    image_file = location
    im = Image.open(image_file)
    text = pyt.image_to_string(im)
    print(text)
    for character in "abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ*^&\n" :
        text = text.replace(character, "")
    return text

test.th("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TEST.png")
print(tti2("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TESTbis.png"))

code of "test" (it's for the thresholding) :

import cv2
from PIL import Image

def th(Path) :
    img = cv2.imread(Path)
    # If your image is not already grayscale :
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    threshold = 60 # to be determined
    _, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
    pil_img = Image.fromarray(img_binarized)
    Path = Path.replace(".png","")
    pil_img.save(Path+"bis.png")

Solution

  • A way to force pytesseract to read only numbers can be done using tessedit_char_whitelist config with only digits values. You can try to improve results using Tesseract documentation: Tesseract - Improving the quality of the output

    Also i suggest you to use:

    • White for the background and black for characters font color.
    • Select desired tesseract psm mode. In the previous case i was using 7 psm mode to treat image as a single text line.
    • Use tessedit_char_whitelist config to specify only the characters that you are sarching for.

    With that in mind, here is the code:

    import cv2
    import numpy as np
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    originalImage = cv2.imread('1.png')
    grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
    (_, blackAndWhiteImage) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY_INV)
    text = pytesseract.image_to_string(blackAndWhiteImage, config="--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789")
    print('Text: ', text)
    cv2.imshow('Image result', blackAndWhiteImage)
    
    cv2.waitKey(0)
    cv2.destroyAllWindows()
    

    And the desired result:

    enter image description here