python python-3.x opencv tesseract python-tesseract

How to setup Tesseract OCR properly

I am using Tesseract OCR trying to convert a preprocessed license plate image into text, but I have not had much success with some images which look very much OK. The tesseract setup can be seen in the function definition. I am running this on Google Colab. The input image is ZG NIVEA 1 below. I am not sure if I am using something wrong or if there is a better way to do this - the result I get form this particular image is A.

!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re

def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
  '''
  oem - OCR Engine Mode
      0 = Original Tesseract only.
      1 = Neural nets LSTM only.
      2 = Tesseract + LSTM.
      3 = Default, based on what is available.
  psm - Page Segmentation Mode
      0 = Orientation and script detection (OSD) only.
      1 = Automatic page segmentation with OSD.
      2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3 = Fully automatic page segmentation, but no OSD. (Default)
      4 = Assume a single column of text of variable sizes.
      5 = Assume a single uniform block of vertically aligned text.
      6 = Assume a single uniform block of text.
      7 = Treat the image as a single text line.
      8 = Treat the image as a single word.
      9 = Treat the image as a single word in a circle.
      10 = Treat the image as a single character.
      11 = Sparse text. Find as much text as possible in no particular order.
      12 = Sparse text with OSD.
      13 = Raw line. Treat the image as a single text line,
          bypassing hacks that are Tesseract-specific.
  '''
  tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
  regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
  return ''.join(regex_result)

image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))

Edit: The approach in the accepted answer works for the ZGNIVEA1 image, but not for others, e.g. , is there a general "font size" that Tesseract OCR works with best, or is there a rule of thumb?

Solution

by applying gaussian blur before OCR, I ended up with the correct output. Also, you may not need to use regex by adding -c tessedit_char_whitelist=ABC.. to your config string.

The code that produces correct output for me:

import cv2
import pytesseract

image = cv2.imread("images/tesseract.png")

config = '--oem 3  --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'

image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)

string = pytesseract.image_to_string(image, config=config)

print(string)

Output:

Answer 2:

Sorry for the late reply. I tested the same code on your second image, and it gave me correct output, are you sure you removed the config part since it doesnt allow numbers in my whitelist.

Most accurate solution here is training your own tesseract model on license plates' fonts (FE-Schrift) instead of tesseract's default eng.traineddata model. It will definetly increase the accuracy since it only contains your case's characters as output classes. As answer to your latter question, tesseract does some preprocessing before the recognition process (thresholding, morphological closing etc.) that is why image it is so sensitive to letter size. (smaller image: contours are closer to eachother so closing will not seperate them).

To train tesseract on custom font you can follow the official docs

To read more about Tesseract's theoritical part you can check these papers: 1 (relatively old) 2 (newer)