Search code examples
pdfutf-8autofillpdftk

Filling pdf fields with Chinese characters garbled


I am trying to fill a pdf field with Chinese characters from an fdf or xfdf.

So far I have tried, pdftk, mcpdf, pdfbox and fpdm.

They can all get the characters into the field, but they don't display. When I click on the field to edit, the characters show as expected, but when I click out of the field again they disappear. If I input English they are displayed incorrectly, eg "hello" becomes "IFMMP".

This has all lead me to suspect it's an issue with fonts/character maps, I have tried embedding the full font into the pdf and it made no difference. I have installed the fonts on the machine to no avail.

If I edit the pdf and fill the field in Acrobat it accepts the Chinese characters without a problem and I can view the pdf in a reader. I have tried using pdftk from the command line on the same Windows machine and I am having the same problem.

I need this to work in a Linux environment, and preferably in python or through a command-line script, but really at this point I'd just like to see it work at all! I have attached the sample pdf, fdf, xfdf and the output it is creating, any help would be greatly appreciated as I've run out of ideas. I have been using the command:

"pdftk test_form.pdf fill_form test.xfdf output output.pdf verbose"

https://drive.google.com/folderview?id=0B6ExNaWGFzvnfnJHSC1ZdXhSU2RQVENjYW56UkZyYWJMdWhZTkpQYkZBcUs0Tjhjb0NITVE&usp=sharing


Solution

  • When a form field is filled the fields value is populated and (optional) a visual appearance for the form field is generated reflecting the newly set value. So the reason that you are seeing the value when you click into the form field is that the fields value will be displayed but as long as the field is not activated the fields appearance is used.

    If you tried setting the value with PDFBox 1.8 you might try using PDFBox 2.0 as this now has unicode support and the appearance generation is redone.

    You also need to ensure that the font you are using in the form is available on the system you are filling your form with. Otherwise with PDFBox 2.0 you might get an error message similar to

    Warning: Using fallback font 'TimesNewRomanPSMT' for 'MingLiU'
    Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+5185 in font MingLiU
    

    Which is as MingLiU is not available on the system it has been replaced by TimesNewRomanPSMT which doesn't have the character needed.

    As another solution you can also direct the Adobe Reader to calculate the appearance for you when the form is opened using

    PDAcroForm form = doc.getDocumentCatalog().getAcroForm();
    form.setNeedAppearances(true);
    

    again using PDFBox 2.0

    I've created a little sample using PDFBox 2 but creating a form from scratch to test if it can handle the Chinese text

    // create a new PDF document
    PDDocument doc = new PDDocument();
    PDPage page = new PDPage();
    
    // add a new AcroForm and add that to the document
    PDAcroForm form = new PDAcroForm(doc);
    doc.getDocumentCatalog().setAcroForm(form);
    
    // Add and set the resources and default appearance at the form level
    PDFont font = PDType0Font.load(doc, new File("/Library/Fonts/Arial Unicode.ttf"));
    PDResources res = new PDResources();
    COSName fontName = res.add(font);
    form.setDefaultResources(res);
    String da = "/" + fontName.getName() + " 12 Tf 0 g";
    form.setDefaultAppearance(da);
    
    // add a page to the document 
    doc.addPage(page);
    
    // add a form field to the form
    PDTextField textBox = new PDTextField(form);
    textBox.setPartialName("Chinese");
    form.getFields().add(textBox);
    
    // specify the annotation associated with the field
    // and add it to the page
    PDAnnotationWidget widget = textBox.getWidget();
    PDRectangle rect = new PDRectangle(100f,300f,120f,350f);
    widget.setRectangle(rect);
    page.getAnnotations().add(widget);
    
    // set the field value
    textBox.setValue("木兰辞");
    doc.save("ChineseOut.pdf");
    

    which works fine. I also tested with the font you are using unfortunately this had an error as MingLiU is a TrueType collection which PDFBox can not handle at that point in time.