Search code examples
pythoncomputer-visionpython-imaging-librarytesseractpython-tesseract

How to make pytesseract work in collab python?


I tried to follow different steps by researching but neither of the steps are helping in executing the pytesseract code.

Downloaded tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

Installed this exe in C:\Program Files\Tesseract-OCR

installed pytesseract using pip

imported pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
a = pytesseract.image_to_string(PIL.Image.open('/content/drive/MyDrive/hindi_image.jpg'),lang='hin')

but this steps throw error

FileNotFoundError                         Traceback (most recent call last)
    /usr/local/lib/python3.7/dist-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
        253     try:
    --> 254         proc = subprocess.Popen(cmd_args, **subprocess_args())
        255     except OSError as e:
    
    6 frames
FileNotFoundError: [Errno 2] No such file or directory: 'tesseract': 'tesseract'
    
During handling of the above exception, another exception occurred:
    
TesseractNotFoundError                    Traceback (most recent call last)
    /usr/local/lib/python3.7/dist-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
        256         if e.errno != ENOENT:
        257             raise e
    --> 258         raise TesseractNotFoundError()
        259 
        260     with timeout_manager(proc, timeout) as error_string:
    
TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

In my local system the path is same as above

How can I resolve this please help. Thankyou!


Solution

  • Google Collab runs on server with Linux so you can't use path to your local Windows.

    You have to install tesseract for Linux on server using

    !apt install tesseract-ocr
    

    and use path to this version.


    But maybe you will not have use this path in code because apt should install tesseract in folder which is on environment variable PATH and code should find tesseract without path.

    I run pytesseract on local Linux and I don't have to set path in code.


    If you will need to use language different than English then you can see all available languages

    !apt search tesseract
    

    and install like (ie. Hindi)

    !apt install tesseract-ocr-hin
    

    It may need also to add option lang='hin' in pytesseract to use this language.
    To use both languages you can try lang='hin+eng'


    EDIT:

    I tested on Google Colab - after installing !apt install tesseract-ocr I can use pytesseract without setting path.


    EDIT:

    pytesseract writes image to file and runs tesseract with path to this file and it writes result in text file, and later pytesseract reads result from text file.

    But you can send directly path. And later read result from text file.

    import pytesseract
    
    pytesseract.pytesseract.run_tesseract('path/to/image.png', 'output', 'txt', lang='hin')
    
    with open('output.txt') as fh:
        result = fh.read()
    
    print(result)
    

    or even

    import pytesseract
    
    def file_to_text(filename, *args, **kwargs):
        pytesseract.pytesseract.run_tesseract(filename, 'output', 'txt', *args, **kwargs)
        with open('output.txt') as fh:
            return fh.read()
    
    # ---
    
    text = file_to_text('path/to/image.png', lang='hin')
    
    print(text.strip())