In my implementation of PDFBox, I have created methods to write strings in multiple languages by testing out different fonts.
PDFont currentFont = PDType0Font.load(pdfDocument, new File("path/to/font/font.ttf"));
for (int offset = 0; offset < sValue.length();) {
int iCodePoint = sValue.codePointAt(offset);
boolean isEncodable = isCodePointEncodable(currentFont, iCodePoint);
//-Further logic here, etc.
offset += Character.charCount(iCodePoint);
}
private boolean isCodePointEncodable (PDFont currentFont, int iCodePoint) throws IOException {
StringBuilder st = new StringBuilder();
st.appendCodePoint(iCodePoint);
try {
currentFont.encode(st.toString());
return true;
} catch (IllegalArgumentException iae) {
return false;
}
}
While this works fine for anything in the Basic Multilingual Plane (BMP), anything that involves unicodes beyond the BMP will not work. I have downloaded and looked at the involved fonts extensively with glyph charts, and have logged each code. For instance, when attempting to encode 🚁, which is U+1F681 (or decimal 128641), I tracked the logging and found it specifically attempted to encode this character in NotoEmoji-Regular.ttf, which is the correct matching one, and does indeed have this character. Unfortunately, it still returned false.
Specifically, my logging server returned this:
Code Point 128641 (🚁) cannot be encoded in font NotoEmoji
Are there any workarounds or solutions for this? Thank you.
I have created and resolved issue PDFBOX-3997. The cause was that we didn't use the best possible cmap subtable.
There is no workaround but the bug will be fixed in version 2.0.9, coming in a few months. But you don't have to wait that long - you can test with a snapshot build.