Search code examples
javaunicodepdfboxemojisurrogate-pairs

PDFBox - Cannot encode Strings comprised of surrogate pairs


In my implementation of PDFBox, I have created methods to write strings in multiple languages by testing out different fonts.

PDFont currentFont = PDType0Font.load(pdfDocument, new File("path/to/font/font.ttf"));
for (int offset = 0; offset < sValue.length();) {
    int iCodePoint = sValue.codePointAt(offset);
    boolean isEncodable = isCodePointEncodable(currentFont, iCodePoint);
    //-Further logic here, etc.

    offset += Character.charCount(iCodePoint);
}

private boolean isCodePointEncodable (PDFont currentFont, int iCodePoint) throws IOException {
    StringBuilder st = new StringBuilder();
    st.appendCodePoint(iCodePoint);
    try {
        currentFont.encode(st.toString());
        return true;
    } catch (IllegalArgumentException iae) {
        return false;
    }
}

While this works fine for anything in the Basic Multilingual Plane (BMP), anything that involves unicodes beyond the BMP will not work. I have downloaded and looked at the involved fonts extensively with glyph charts, and have logged each code. For instance, when attempting to encode 🚁, which is U+1F681 (or decimal 128641), I tracked the logging and found it specifically attempted to encode this character in NotoEmoji-Regular.ttf, which is the correct matching one, and does indeed have this character. Unfortunately, it still returned false.

Specifically, my logging server returned this:

Code Point 128641 (🚁) cannot be encoded in font NotoEmoji

Are there any workarounds or solutions for this? Thank you.


Solution

  • I have created and resolved issue PDFBOX-3997. The cause was that we didn't use the best possible cmap subtable.

    There is no workaround but the bug will be fixed in version 2.0.9, coming in a few months. But you don't have to wait that long - you can test with a snapshot build.