Search code examples
javapdfencodingutf-8pdfbox

How to sanitise a string before printing it to PDF with PDFBox


I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this. All examples seem to be pointing at removed code (font.setToUnicodeor some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks, JM


Solution

  • I ended doing a character by character sanitization.

    Here what my sanitization function looks like.

    To avoid reprocessing characters, I am caching the availability of each character for each given font.

    When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

    It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

    There might be a lot of things to improve but this works for my use case.

    private String getPrintableString(String string, PDFont font) {
    
        StringBuilder sb = new StringBuilder();
    
        for (int i = 0; i < string.length(); i++) {
    
            int codePoint = string.codePointAt(i);
    
            if (codePoint == 0x000A) {
                sb.appendCodePoint(codePoint);
                continue;
            }
    
            String fontName = font.getName();
            int cpKey = fontName.hashCode();
            cpKey = 31 * cpKey + codePoint;
    
            if (codePointAvailCache.get(cpKey) == null) {
    
                try {
                    font.encode(string.substring(i, i + 1));
                    codePointAvailCache.put(cpKey, true);
                } catch (Exception e) {
                    codePointAvailCache.put(cpKey, false);
                }
            }
    
            if (!codePointAvailCache.get(cpKey)) {
    
                // Need to make sure our font has a replacement character
                try {
                    codePoint = 0xFFFD;
                    font.encode(new String(new int[] { codePoint }, 0, 1));
                } catch (Exception e) {
                    codePoint = 0x003F;
                }
            }
    
            sb.appendCodePoint(codePoint);
        }
    
        return sb.toString();
    }