Search code examples
ruby-on-railsrubyunicodewkhtmltopdfwicked-pdf

wicked_pdf shows unknown character on unicode pdf conversion (ruby)


I'm trying to create a pdf from a html page using wicked_pdf (version 1.1) and wkhtmltopdf-binary gems. My html page contains a calendar emoji that displays well in the browser whatever font I use

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta http-equiv='content-type' content='text/html; charset=utf-8' />
  <style>
  unicode {
     font-family: 'OpenSansEmoji', sans-serif;
  }
  @font-face {
     font-family: 'OpenSansEmoji';
     src: url(data:font/truetype;charset=utf-8;base64,<-- encoded_font_base64_string-->) format('truetype');
  }
 </style>
 </head>
 <body>
 <div><unicode>&#128197;</unicode></div>
 </body>
 </html>

However, when I try to generate the PDF using the WickedPdf.new.pdf_from_html_file method of the gem in the rails console,

 File.open(File.expand_path('~/<--pdf_filename-->.pdf'), 'wb+') {|f| f.write  WickedPdf.new.pdf_from_html_file('<--absolute_path_of_html_file-->')}  

I get the following result:

PDF result with unknown character

As you can see, the first calendar icon is properly displayed, however there is a second character that is displayed, we do not know where it's coming from.

I have investigated through encoding in UTF-8 and UTF-16 and surrogate pair as suggested by this related post stackoverflow_emoji_wkhtmltopdf and looked at this issue wkhtmltopdf_git_issue but still can't make this character disappear!

If you have any clue, it's more than welcome.

Thanks in advance for your help!

EDIT

Following the comments from Eric Duminil and petkov.np, I can confirm - the code above works for me properly on Linux. Seems like this is a Linux vs MacOS issue. Can anyone suggest what the core of the issue in MacOS binding and whether it can be fixed?


Solution

  • I've edited this answer several times, please see the notes at the end as well as the comments.

    I'm using macOS 10.12.2 and have the same issue. I'm listing all the browser etc. versions, although I suspect the biggest factor is the OS/wkhtmltopdf build.

    • Chrome: Version 55.0.2883.95 (64-bit)
    • Safari: Version 10.0.2 (12602.3.12.0.1)
    • wkhtmltopdf: 0.12.3 (with patched qt)

    I'm using the following example snippet:

    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html" charset="utf-8">
        <style type="text/css">
          p {
            font-family: 'EmojiSymbols', sans-serif;
          }
          @font-face {
            font-family: 'EmojiSymbols';
            src: local('EmojiSymbols-Regular.woff'), url('EmojiSymbols-Regular.woff') format('woff');
          }
    
          span:before {
            content: '\01F60B';
          }
        </style>
      </head>
      <body>
        <p>
          😋
          <span></span>
          &#x1F60B;
          &#128523;
          &#xf0;&#x9f;&#x98;&#x8b;
        </p>
      </body>
    </html>
    

    I'm calling wkhtmltopdf with the --encoding 'UTF-8' option.

    You can see the rendered result here (I'm sorry for the lame screenshot). Some brief conclusions:

    1. Safari doesn't render the 'raw' UTF-8 bytes properly. It seems to treat them just as the raw byte sequence (last line in the html paragraph). Safari renders everything fine.
    2. Chrome renders everything fine.
    3. With the above option, wkhtmltopdf renders the raw bytes (sort of) ok, but doesn't render the CSS content attribute properly. Every 'proper' occurrence of the unicode symbol is followed by this strange phantom symbol.

    I've tried literally everything but the results are the same. For me, the fact that even Safari doesn't render the raw bytes properly indicates some system-level problem that is macOS specific. It's unclear to me wether this should be reported as a wkhtmltopdf issue or there is some misbehaved dependency in the macOS build.

    EDIT: Safari seems to work fine, my markup was broken.

    EDIT: A CSS workaround may do the trick, please check the comments below.

    FINAL EDIT: As shown in the comments, the CSS 'hack' that solves the issues is using text-rendering: optimizeLegibility;. This seems to only be needed on macOS/OS X.

    From my comment below:

    I just found this issue. It seems irrelevant at first glance, but adding text-rendering: optimizeLegibility; to my styles removed the duplicate characters (on macOS). Why this happens is beyond me. As the issue author also uses osx, it's apparent there is some problem withwkhtmltopdf builds for this os.