We're working to convert a PHP docker image from Ubuntu to Alpine to reduce the image size, remove unnecessary dependencies and decrease built time. Due to the version of PHP we need to support, we can only use Alpine 3.10 for the moment.
One of the tools in the application uses is wkhtmltopdf
to convert HTML files to PDFs. This works great for common English characters but seems to struggle with other characters such as Chinese or Thai.
To reproduce using the below Dockerfile
and test.html
:
------- Dockerfile -------
FROM alpine:3.10
RUN apk update && apk --no-cache add \
git libcurl wget \
curl tzdata procps vim \
python3 py3-pip \
zip unzip \
libsasl \
openssl \
libpng \
libjpeg \
libjpeg-turbo \
freetype \
libxml2 \
fontconfig \
icu libzip \
wkhtmltopdf \
libgcc libstdc++ libx11 glib libxrender libxext libintl \
font-noto-arabic terminus-font ttf-inconsolata ttf-dejavu font-noto font-noto-extra \
ttf-dejavu ttf-droid ttf-freefont ttf-liberation ttf-ubuntu-font-family \
libpng-dev libjpeg-turbo-dev freetype-dev libxml2-dev icu-dev autoconf gcc g++ make libzip-dev \
&& rm -rf /var/cache/apt/* && rm /var/cache/apk/*
COPY ./test.html ./
------- test.html -------
<html>
<body>
<p>English</p>
<p>電子郵件</p>
</body>
</html>
$ docker build -t character_test .
$ docker run --name character_test character_test wkhtmltopdf ./test.html ./test.pdf
$ docker cp character_test:./test.pdf ./test.pdf
$ docker rm character_test
$ docker rmi character_test
Now if you open the PDF, you can see something like the below which does not match the characters in the html file.
As you can see from the Dockerfile, I'm fairly sure we've installed just about every known font for Alpine in an attempt to resolve this but we're not really sure of the problem or how to resolve.
What is causing these characters to display incorrectly and how can we resolve it in our image?
I did not optimize the instructions in the Dockerfile, just to quickly conduct a POC to verify the feasibility of certain concepts.
docker_wkhtml2pdf
├── Dockerfile (1)
├── simsun.ttc (2)
└── data
└── test3.html (3)
There are two main problems with PDF display. (1) One is the encoding problem, so I added the locale-related installation and settings (2) The other is the font, I added the SimSun.ttc
FROM alpine:3.12
ENV LANG=en_US.UTF-8 \
LANGUAGE=en_US:en \
LC_ALL=en_US.UTF-8
RUN mkdir -p /usr/share/fonts/chinese/TrueType
COPY simsun.ttc /usr/share/fonts/chinese/TrueType/
RUN apk update
RUN apk add --no-cache \
bash \
libc6-compat \
musl-locales \
musl-locales-lang
RUN apk --no-cache add \
git libcurl wget \
curl tzdata procps vim \
python3 py3-pip \
zip unzip \
libsasl \
openssl \
libpng \
libjpeg \
libjpeg-turbo \
freetype \
libxml2 \
fontconfig \
icu libzip \
wkhtmltopdf \
libgcc libstdc++ libx11 glib libxrender libxext libintl \
font-noto-arabic terminus-font ttf-inconsolata ttf-dejavu font-noto font-noto-extra \
ttf-dejavu ttf-droid ttf-freefont ttf-liberation ttf-ubuntu-font-family \
libpng-dev libjpeg-turbo-dev freetype-dev libxml2-dev icu-dev autoconf gcc g++ make libzip-dev
RUN rm -rf /var/cache/apt/*
RUN rm /var/cache/apk/*
RUN echo "export LANG=en_US.UTF-8" >> /etc/profile
WORKDIR /documents
VOLUME /documents
# COPY ./test.html ./
# COPY ./test2.html ./
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<p>English</p>
<p>電子郵件</p>
<p>สวัสดี</p>
</body>
</html>
COPY simsun.ttc to project directory
docker build -t character_test .
docker run --rm \
--user 1000:1000 \
-v `pwd`/data:/documents/ \
character_test \
wkhtmltopdf test3.html test3.pdf
Output
docker_wkhtml2pdf
├── Dockerfile
├── simsun.ttc
└── data
├── test3.html
└── test3.pdf (4) Output files
Successfully displayed Chinese text string.
Viewing the PDF properties, you can see that a SimSun font is embedded, which is a font used for Chinese.