Search code examples
pythonocrpython-tesseract

Python PyTesseract Acuracy Improvement


I know this has been asked before, and I have been trying several different methods and changing things, but cannot figure out how to get this to work. I have a bunch of pages where this works perfectly. This is clear text perfectly laid out. But for some reason, on one of the sheets it is messing up and reading completely wrong info. Below I have attached my code, output, and the image.

import pytesseract
import cv2
import numpy as np

img = cv2.imread('page_3.jpg')

img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1, 1), np.uint8)

cv2.imwrite('thresh.png', img)


for psm in range(6, 13 + 1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config=config, lang='eng')
    print('psm ', psm, ':', txt)

Here is the photo: enter image description here

And then here is the output. It works perfectly until the end for some reason. All of the outputs (psm 6, 11, and 12) are reading the exact same. Any help is appreciated.

1885-1015

1886-1280

1956-0044

2087-0047

2087-0155

2087-1433

2221-0093L

2221-0093R

2331-4628R

2992-/114R

29593-0007R


Solution

  • Your image does not require any pre-processing at all. It is already perfect and structured. So try not to resize the image before passing it to tesseract. enter image description here Resizing is not needed in your case. Hope this helps.