Search code examples
pythonocrtesseractpython-tesseract

Trying to recognize Captcha with OpenCV & Tesseract in python, but not good Accuracy


I'm trying to recognize Captcha to Text.

This captcha is not very difficult. (as I think).

I open the image and convert it with OpenCV, to make it easy to recognize.

I will show you an example. Example Captcha

Example Captcha

After OpenCV Catpcha

After OpenCV Catpcha

image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imwrite('OPENCV.png', gray)

# Get Text From Image
pytesseract.image_to_string(Image.open('OPENCV.png'), lang='eng', config="-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 8")

It's simple. But result is 'PLLY2', But I want 'PLLVI2' OR 'PLLV12'.

Is there any option or another way that I can use to get more accuracy?

I use one word option that 'psm 8'. I had tried to find to make tesseract find fixed number of characters, but it is impossible.

I will really appreciate it if you give me just a hint. Thank you very much for reading this question.


Solution

  • You could slice the image to make each letter and use --psm 10:

    image = cv2.imread(filename)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 
    gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    
    gray1 = gray[:, :25]
    gray2 = gray[:, 25:50]
    gray3 = gray[:, 50:75]
    gray4 = gray[:, 75:100]
    gray5 = gray[:, 100:125]
    gray6 = gray[:, 125:]
    
    print(''.join([pytesseract.image_to_string(i, config='--psm 10 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ') for i in [gray1, gray2, gray3, gray4, gray5, gray6]])