Search code examples
dockerpdf-generationwkhtmltopdfalpine-linux

Docker Alpine wkhtmltopdf Chinese / Thai Characters Display Incorrectly


We're working to convert a PHP docker image from Ubuntu to Alpine to reduce the image size, remove unnecessary dependencies and decrease built time. Due to the version of PHP we need to support, we can only use Alpine 3.10 for the moment.

One of the tools in the application uses is wkhtmltopdf to convert HTML files to PDFs. This works great for common English characters but seems to struggle with other characters such as Chinese or Thai.

To reproduce using the below Dockerfile and test.html:

------- Dockerfile -------

FROM alpine:3.10

RUN apk update && apk --no-cache add \
        git libcurl wget \
        curl tzdata procps vim \
        python3 py3-pip \
        zip unzip \
        libsasl \
        openssl \
        libpng \
        libjpeg \
        libjpeg-turbo \
        freetype \
        libxml2 \
        fontconfig \
        icu libzip \
        wkhtmltopdf \
        libgcc libstdc++ libx11 glib libxrender libxext libintl \
        font-noto-arabic terminus-font ttf-inconsolata ttf-dejavu font-noto font-noto-extra \
        ttf-dejavu ttf-droid ttf-freefont ttf-liberation ttf-ubuntu-font-family \
        libpng-dev libjpeg-turbo-dev freetype-dev libxml2-dev icu-dev autoconf gcc g++ make libzip-dev \
    && rm -rf /var/cache/apt/* && rm /var/cache/apk/*

COPY ./test.html ./

------- test.html -------
<html>

<body>
    <p>English</p>
    <p>電子郵件</p>
</body>

</html>

$ docker build -t character_test . 
$ docker run --name character_test character_test wkhtmltopdf ./test.html ./test.pdf
$ docker cp character_test:./test.pdf ./test.pdf
$ docker rm character_test
$ docker rmi character_test

Now if you open the PDF, you can see something like the below which does not match the characters in the html file.

PDF Output

As you can see from the Dockerfile, I'm fairly sure we've installed just about every known font for Alpine in an attempt to resolve this but we're not really sure of the problem or how to resolve.

What is causing these characters to display incorrectly and how can we resolve it in our image?


Solution

  • I did not optimize the instructions in the Dockerfile, just to quickly conduct a POC to verify the feasibility of certain concepts.

    Project Directory

    docker_wkhtml2pdf
    ├── Dockerfile   (1)  
    ├── simsun.ttc   (2) 
    └── data
        └── test3.html (3)
    

    Dockerfile (1)

    There are two main problems with PDF display. (1) One is the encoding problem, so I added the locale-related installation and settings (2) The other is the font, I added the SimSun.ttc

    FROM alpine:3.12
    
    ENV LANG=en_US.UTF-8 \
        LANGUAGE=en_US:en \
        LC_ALL=en_US.UTF-8
    
    RUN mkdir -p /usr/share/fonts/chinese/TrueType
    COPY simsun.ttc /usr/share/fonts/chinese/TrueType/
    
    RUN apk update
    
    RUN apk add --no-cache \
        bash \
        libc6-compat \
        musl-locales \
        musl-locales-lang
    
    RUN apk --no-cache add \
            git libcurl wget \
            curl tzdata procps vim \
            python3 py3-pip \
            zip unzip \
            libsasl \
            openssl \
            libpng \
            libjpeg \
            libjpeg-turbo \
            freetype \
            libxml2 \
            fontconfig \
            icu libzip \
            wkhtmltopdf \
            libgcc libstdc++ libx11 glib libxrender libxext libintl \
            font-noto-arabic terminus-font ttf-inconsolata ttf-dejavu font-noto font-noto-extra \
            ttf-dejavu ttf-droid ttf-freefont ttf-liberation ttf-ubuntu-font-family \
            libpng-dev libjpeg-turbo-dev freetype-dev libxml2-dev icu-dev autoconf gcc g++ make libzip-dev
    
    
    RUN rm -rf /var/cache/apt/*
    RUN rm /var/cache/apk/*
    
    RUN echo "export LANG=en_US.UTF-8" >> /etc/profile
    
    WORKDIR /documents
    VOLUME /documents
    
    # COPY ./test.html ./
    # COPY ./test2.html ./
    
    

    test3.html (3)

    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Title</title>
    </head>
    <body>
        <p>English</p>
        <p>電子郵件</p>
        <p>สวัสดี</p>
    </body>
    </html>
    

    simsun.ttc (2)

    COPY simsun.ttc to project directory

    Build image

    docker build -t character_test . 
    

    Run

    docker run --rm  \
           --user 1000:1000 \
           -v `pwd`/data:/documents/ \
           character_test \
           wkhtmltopdf test3.html test3.pdf
    

    Output

    docker_wkhtml2pdf
    ├── Dockerfile
    ├── simsun.ttc
    └── data
        ├── test3.html
        └── test3.pdf    (4) Output files
    

    Check test3.pdf (4)

    Successfully displayed Chinese text string.

    enter image description here

    Viewing the PDF properties, you can see that a SimSun font is embedded, which is a font used for Chinese.