python tesseract python-tesseract opencv

tesseract detects only 4 words from image

I have very simple python code:

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread('1.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

hImg,wImg,_ = img.shape

#detecting words
boxes = pytesseract.image_to_data(img)
for x,b in enumerate(boxes.splitlines()):
    if x!=0:
        b = b.split()
        if len(b) == 12:
            x,y,w,h = int(b[6]), int(b[7]), int(b[8]), int(b[9])
            cv2.rectangle(img, (x,y), (w+x,h+y), (0,0,255), 3)


cv2.imshow('result', img)
cv2.waitKey(0)

But result was interesting. It detected only 4 words. what could it be the reason?

Solution

You'll have better OCR results if you improve the quality of the image you are giving Tesseract.

While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background.

Convert from BGR to HLS to later remove background colors from the numbers in the top half of the image. Then, create a "blue" mask with cv2.inRange and replace anything that's not "blue" with the color white.

hls=cv2.cvtColor(img,cv2.COLOR_BGR2HLS)

# Define lower and upper limits for the number colors.
blue_lo=np.array([114, 70, 70])
blue_hi=np.array([154, 225, 225])

# Mask image to only select "blue"
mask=cv2.inRange(hls,blue_lo,blue_hi)

# copy original image
img1 = img.copy()
img1[mask==0]=(255,255,255)

Help pytesseract by converting the image to black and white

This is converting an image to black and white. Tesseract does this internally (Otsu algorithm), but the result can be suboptimal, particularly if the page background is of uneven darkness.

rgb = cv2.cvtColor(img1, cv2.COLOR_HLS2RGB)
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
_, img1 = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.imshow('img_to_binary',img1)

Use image_to_data over the previously created img1 and continue applying your existing code.

...
hImg,wImg,_ = img.shape

#detecting words
boxes = pytesseract.image_to_data(img1)
for x,b in enumerate(boxes.splitlines()):
    ...
...