Search code examples
pythonamazon-web-servicesaws-lambdaenvironment-variablestesseract

How to define tesseract_cmd to use Tesseract-OCR in AWS Lambda Functions


I am using AWS for process images and extract text with Tesseract and Python. In my backend, I uploaded the pytesseract library and the Tesseract-OCR folder. Locally it works very well, I neither need to change tesseract-cmd to find tesseract.exe. When I upload this folder to AWS Lambda, it returns one TesseractNotFound error saying that tesseract is not installed or it's not in your PATH. I already tried to change tesseract-cmd but I did not could solve it. My folder structure is /opt/python/lib/python3.7/site-packages and inside site-packages I have my libraries (Pillow, pytesseract, Tesseract-OCR). I already tried to create one new Lambda Function using this and this options but neither work. I think I can solve it using Environment Variables but I have no idea how to do it.

error

my folder structure

If someone knows how to do it in a better way that works I will accept as one answer too


Solution

  • To solve this error I needed to make a bunch of things but in the end it works. As was commented, AWS Lambda runs in a Linux environment, so you will need to compile the libraries as you did for execute in a Linux environment. In my case, I don't have one Linux machine to do it, so I followed the following steps:

    You can skip step 1 just downloading the files here

    1 - (If you don't have one Linux machine) I started one EC2 instance with Amazon Linux AMI, the basic instance will work very well.

    sudo yum update
    sudo yum install git-core -y
    sudo yum install docker -y
    sudo service docker start
    sudo usermod -a -G docker ec2-user #It will allow ec2-user to call docker
    

    After the last code was executed, you need to restart you EC2 instance (just disconnect and reconnect)

    git clone https://github.com/amtam0/lambda-tesseract-api.git
    cd lambda-tesseract-api/
    bash build_tesseract4.sh #It will take some time
    bash build_py37_pkgs.sh
    

    After it, you will have one folder (lambda-tesseract-api) zipped with all files that you need. In my case, I created one GitHub repository and uploaded all files to there, and then downloaded it on my computer to create my Lambda Layers.

    2 - After downloading the files you will upload the zip files to your Layers, one by one (open-cv, Pillow, tesseract, pytesseract) and the use the layers on your Lambda Function to run tesseract.

    This is the lambda-handler function that you will create to tesseract works. (oem, psm and lang are tesseract parameters and you can learn more here)

    import base64
    import pytesseract
    
    def ocr(img,oem=None,psm=None, lang=None):
        
      config='--oem {} --psm {} -l {}'.format(oem,psm,lang)
      ocr_text = pytesseract.image_to_string(img, config=config)
        
      return ocr_text
          
    def lambda_handler(event, context):
        
        # Extract content from json body
        body_image64 = event['image64']
        oem = event["tess-params"]["oem"]
        psm = event["tess-params"]["psm"]
        lang = event["tess-params"]["lang"]
        
        # Decode & save inp image to /tmp
        with open("/tmp/saved_img.png", "wb") as f:
          f.write(base64.b64decode(body_image64))
        
        # Ocr
        ocr_text = ocr("/tmp/saved_img.png",oem=oem,psm=psm,lang=lang)
        
        # Return the result data in json format
        return {
          "ocr": ocr_text,
        }
    

    You will also need to set one Environment Variable. The key will be PYTHONPATH and the values will be /opt/

    Reference:

    https://medium.com/analytics-vidhya/build-tesseract-serverless-api-using-aws-lambda-and-docker-in-minutes-dd97a79b589b

    Tesseract OCR on AWS Lambda via virtualenv (Alex Albracht Answer)