Search code examples
opencvocrtesseractpreprocessor

Steps to improve pre-processing for OCR


Spend half a day trying to find the best way to pre-process image for Tesseract OCR and did not find any good results besides thresholding. Can anybody suggest what kind of steps I should try? OpenCV, ImageMagick, Gimp is fine for me as tools, Images can have different backgrounds but the font and color of the font will be always the same. Here are the image samples:

  1. Image 1
  2. Image 2
  3. Image 3

I`ve got something like that currently using threshold filters: enter image description here

and text from OCR like that: "ELIMINATED LIFELINES220_{¢-\"| “, Vv a . —"


Solution

  • I`ve found a good article that describes a lot of pre-processing steps https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

    But the best one was to use "Top-hat morphological operation" - manupulations using neighborhood pixels. That can be done using OpenCV
    tophat = cv2.morphologyEx(gray, cv2.MORPH_TOPHAT, rectKernel)

    or can be done using ImageMagick http://www.imagemagick.org/Usage/morphology/#top-hat