I am building a docker image based on alpine that has a dependency with tesseract for OCR. The tesseract site list two flavors of English, eng (modern english) and enm (middle english). However, I am having issues getting the eng version installed on Alpine.
My Dockerfile has the following:
FROM eclipse-temurin:17-jre-alpine as tesseract-master
RUN apk update && apk add tesseract-ocr
RUN apk update && apk add tesseract-ocr-data-eng
This fails to find the eng language package. During the build process, repo is listed and it is clear that it does not have the eng package.
I am able to install the enm package, but I feel like there will be issues since it is for middle english.
Has anyone had success installing the eng package on Alpine?
If you look at the content one of those packages for a language, for example the tesseract-ocr-data-enm
one, you will quickly realise it contains only one file:
Source: https://pkgs.alpinelinux.org/contents?name=tesseract-ocr-data-enm&branch=v3.17&arch=aarch64
Now, if you reverse engineer it, you can try to find which package does contains the file /usr/share/tessdata/eng.traineddata, and it is, with no big surprise, the default package: tesseract-ocr
.
Source: https://pkgs.alpinelinux.org/contents?file=eng.traineddata&branch=v3.17&arch=aarch64
So, your Dockerfile should simply be:
FROM eclipse-temurin:17-jre-alpine as tesseract-master
RUN apk add --no-cache \
tesseract-ocr