Search code examples
javaencodingpdfbox

U+FFFD is not available in this font's encoding: WinAnsiEncoding


I'm using PDFBox 2.0.1.

I try to dynamically add some (user provided) UTF8 text to the form fields and show the result to the user. Unfortunately either the pdf library is not capable of properly encoding special characters such as "äöü"... or I was not able find any useful documentation that could help me with this issue.

Can someone tell me what is wrong with the given code sample?

try (PDDocument document = PDDocument.load(pdfTemplate)) {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    PDAcroForm form = catalog.getAcroForm();

    List<PDField> fields = form.getFields();
    for (PDField field : fields) {
        switch (field.getPartialName()) {
            case "devices":
                // Frontend (JS): userInput = btoa('Gerät')
                String userInput = ...
                String name = new String(Base64.getDecoder().decode(base64devices), "UTF-8");
                field.setReadOnly(true);
                break;
        }
    }
    form.flatten(fields, true);
    document.save(bos);
}

And here the stacktrace of the error:

java.lang.IllegalArgumentException: U+FFFD is not available in this font's encoding: WinAnsiEncoding
    org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:368)
    org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:286)
    org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:315)
    org.apache.pdfbox.pdmodel.interactive.form.PlainText$Paragraph.getLines(PlainText.java:169)
    org.apache.pdfbox.pdmodel.interactive.form.PlainTextFormatter.format(PlainTextFormatter.java:182)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.insertGeneratedAppearance(AppearanceGeneratorHelper.java:373)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:237)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:144)
    org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:263)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:324)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.flatten(PDAcroForm.java:213)
    my.application.service.PDFService.generatePDF(PDFService.java:201)

I also found those (related) issues on SO:

pdfbox: ... is not available in this font's encoding But that does not help me choose the right encoding or how. IIRC Java uses UTF16 internally for character encoding why is the default not enough though? Is that an issue of the PDF-document itself or the code I use to set it?


PdfBox encode symbol currency euro Well its dynamic user input, so there are way to many things I would have to replace myself.

Thus, if the PDFBox people decided to fix the broken PDFBox method, this seemingly clean work-around code here would start to fail as it would then feed the fixed method broken input data.

Admittedly, I doubt they will fix this bug before 2.0.0 (and in 2.0.0 the fixed method has a different name), but one never knows...

Unfortunately I was not able to find this other setter method, but it might also be a different scope it does apply to.

EDIT

Updated example code to better represent the problem.


Solution

  • U+FFFD is used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function (source).

    That said it is likely that that character gets messed up somewhere. Maybe the encoding of the file is not UTF-8 and that's why the character is messed up.

    As a general rule you should only write ASCII characters in the source code. You can still represent the whole Unicode range using the escaped form \uXXXX. In this case ä -> \u00E4.

    -- UPDATE --

    Apparently the problem is in how the user input get encoded/decoded from client/server side using the JS function btoa. A solution to this problem can be found at this link:

    Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings