Search code examples
amazon-web-servicesamazon-elastic-beanstalktesseractepel

Fastest way to install Tesseract on Elastic Beanstalk


I am currently using Tika to extract text from files uploaded to my Rails app running on AWS Elastic Beanstalk (64bit Amazon Linux 2016.03 v2.1.2 running Ruby 2.2). I'd like to index scanned images as well, so I need to install Tesseract.

I was able to get it to work by installing it from source like so, but it added 10 minutes to my deploys to a fresh instance. Is there a faster way to do this?

.ebextensions/02-tesseract.config

packages:
  yum:
    autoconf: []
    automake: []
    libtool: []
    libpng-devel: []
    libtiff-devel: []
    zlib-devel: []

container_commands:
  01-command:
    command: mkdir -p install
    cwd: /home/ec2-user
  02-command:
    command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
  03-command:
    command: bash install/install_tesseract.sh
    cwd: /home/ec2-user

.ebextensions/scripts/install_tesseract.sh

#!/usr/bin/env bash

cd_to_install () {
  cd /home/ec2-user/install
}

cd_to () {
  cd /home/ec2-user/install/$1
}

if ! [ -x "$(command -v tesseract)" ]; then
  # Add `usr/local/bin` to PATH
  echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
  chmod +x /etc/profile.d/usr_local.sh

  # Install leptonica
  cd_to_install
  wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
  tar -zxvf leptonica-1.73.tar.gz
  cd_to leptonica-1.73
  ./configure
  make
  make install
  rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
  rm -rf /home/ec2-user/install/leptonica-1.73

  # Install tesseract ~ the jewel of Odin's treasure room
  cd_to_install
  wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
  tar -zxvf 3.04.01.tar.gz
  cd_to tesseract-3.04.01
  ./autogen.sh
  ./configure
  make
  make install
  ldconfig
  rm -rf /home/ec2-user/install/3.04.01.tar.gz
  rm -rf /home/ec2-user/install/tesseract-3.04.01

  # Install tessdata
  cd_to_install
  wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
  tar -zxvf 3.04.00.tar.gz
  cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
  rm -rf /home/ec2-user/install/3.04.00.tar.gz
  rm -rf /home/ec2-user/install/tessdata-3.04.00
fi

Solution

  • Short answer

    .ebextensions/02-tesseract.config

    commands:
      01-libwebp:
        command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
      02-tesseract:
        command: "yum --enablerepo=epel -y install tesseract"
    

    Long answer

    I'm not familiar with non-Ubuntu package managers or ebextensions, so after some digging, I found that there are precompiled binaries that can be installed on Amazon Linux in the stable EPEL repo.

    The first obstacle was figuring out how to use the EPEL repo. The easiest way is to use the enablerepo option on the yum command.

    That gets us here:

    yum --enablerepo=epel install tesseract
    

    Next, I had to resolve this dependency error:

    [root@ip-10-0-1-193 ec2-user]# yum install --enablerepo=epel tesseract
    Loaded plugins: priorities, update-motd, upgrade-helper
    951 packages excluded due to repository priority protections
    Resolving Dependencies
    --> Running transaction check
    ---> Package tesseract.x86_64 0:3.04.00-3.el6 will be installed
    --> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el6.x86_64
    --> Running transaction check
    ---> Package leptonica.x86_64 0:1.72-2.el6 will be installed
    --> Processing Dependency: libwebp.so.5()(64bit) for package: leptonica-1.72-2.el6.x86_64
    --> Finished Dependency Resolution
    Error: Package: leptonica-1.72-2.el6.x86_64 (epel)
               Requires: libwebp.so.5()(64bit)
     You could try using --skip-broken to work around the problem
     You could try running: rpm -Va --nofiles --nodigest
    

    I found the solution here

    Just adding the epel repo doesn't solve it, as the packages in the amzn-main repository seem to overrule those in the epel repository. If the libwebp package in the amzn-main repo are excluded it should work

    The Tesseract install has some dependencies found in the amzn-main repo. This is why I first install libwebp with --disablerepo=amzn-main.

    yum --enablerepo=epel --disablerepo=amzn-main install libwebp
    yum --enablerepo=epel install tesseract
    

    Finally, here's how you can install yum packages on Elastic Beanstalk with options:

    .ebextensions/02-tesseract.config

    commands:
      01-libwebp:
        command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
      02-tesseract:
        command: "yum --enablerepo=epel -y install tesseract"
    

    Fortunately, this is also the easiest way to install Tesseract on Elastic Beanstalk!