Search code examples
pythonocrpython-tesseract

Python Text Extraction Tesseract


I am trying to extract text from an image using python tesseract. I have tried multiple fail extractions. What is the reason that tesseract is unable to extract text? Here is the image [Image]

Code

import cv2
import pytesseract as pt
inp = "./image.jpg"
img = cv2.imread(inp)
print(pt.image_to_string(img))

Version

tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

Solution

  • You could do some preprocessing with opencv to fix the problem

    try:
        from PIL import Image
    except ImportError:
        import Image
    import pytesseract # pip install pytesseract
    import cv2 # pip install opencv-python
    
    # Opens the image with opencv
    image = cv2.imread("test.jpg",0) #change to your file
    # Preprocesses the image
    thresh = cv2.threshold(image,0,255,cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    # Only prints allowed chars which is 0123456789:
    print(pytesseract.image_to_string(thresh, lang='eng', \
               config='--psm 6 -c tessedit_char_whitelist=0123456789:'))
    

    Output:

    05:26:34
    09:04:24
    01:00:31
    01:14:36
    01:17:43
    02:31:05
    02:35:41
    05:32:42
    03:26:09
    02:44:11
    02:56:00
    02:32:42
    02:35:16
    07:16:10
    07:18:36
    07:19:00
    07:19:32
    07:21:17
    07:21:48
    

    Keep in mind you also need tesseract installed and added to the path

    If you get a lot of random stuff or it didn't find the language "eng" there is a easy fix: If you are on linux cd into /usr/local/share/tessdata or /usr/share/tessdata and run

    sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
    

    That will download the english language file and hopefully fix the problem

    Tessreact version:

    >> tesseract --version
    tesseract 4.1.1
     leptonica-1.81.0
      libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.0
     Found AVX2
     Found AVX
     Found FMA
     Found SSE
     Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.5