I am creating PDF documents from user inputs that are UTF-8.
Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding
.
Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).
The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.
The issue is that I cannot find valid example to achieve this.
All examples seem to be pointing at removed code (font.setToUnicode
or some method in encoding to convert characters one at a time).
So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).
Many thanks, JM
I ended doing a character by character sanitization.
Here what my sanitization function looks like.
To avoid reprocessing characters, I am caching the availability of each character for each given font.
When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.
It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.
There might be a lot of things to improve but this works for my use case.
private String getPrintableString(String string, PDFont font) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < string.length(); i++) {
int codePoint = string.codePointAt(i);
if (codePoint == 0x000A) {
sb.appendCodePoint(codePoint);
continue;
}
String fontName = font.getName();
int cpKey = fontName.hashCode();
cpKey = 31 * cpKey + codePoint;
if (codePointAvailCache.get(cpKey) == null) {
try {
font.encode(string.substring(i, i + 1));
codePointAvailCache.put(cpKey, true);
} catch (Exception e) {
codePointAvailCache.put(cpKey, false);
}
}
if (!codePointAvailCache.get(cpKey)) {
// Need to make sure our font has a replacement character
try {
codePoint = 0xFFFD;
font.encode(new String(new int[] { codePoint }, 0, 1));
} catch (Exception e) {
codePoint = 0x003F;
}
}
sb.appendCodePoint(codePoint);
}
return sb.toString();
}