Search code examples
unicodeitextpdfstamper

Unicode characters in document info dictionary keys


How do I create document info dictionary keys containing unicode characters (typically swedish characters, for instance C3A4 U+00E4 ä). I would like to use the PdfStamper to enter my own metadata in the document info dictionary, but I can't get it to accept the swedish characters.

Entering custom metadata using Acrobat works fine and looking at the PDF in a text editor I can see that the characters get encoded like for instance #C3#A4 for the character mentioned above. So is there a way to achieve this programmatically using iText PdfStamper???

regards Mattias

PS. There is no problem having unicode characters in the info dictionary values, but the keys are a different story.


Solution

  • Please take a look at the NameObject example, and give it a try. You'll see that iText automatically escapes special characters in names.

    iText follows the ISO-32000-1 specification that stats (7.3.5, Name Objects):

    Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of the same sequence of characters denote the same object. Atomic means that a name has no internal structure; although it is defined by a sequence of characters, those characters are not considered elements of the name.

    not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:

    a) A NUMBER SIGN (23h) (#) in a name shall be written by using its 2-digit hexadecimal code (23), preceded by the NUMBER SIGN.

    b) Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

    c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.

    NOTE 1: There is not a unique encoding of names into the PDF file because regular characters may be coded in either of two ways.

    White space used as part of a name shall always be coded using the 2-digit hexadecimal notation and no white space may intervene between the SOLIDUS and the encoded name.

    Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation.

    The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters.

    NOTE 2 The examples shown in Table 4 and containing # are not valid literal names in PDF 1.0 or 1.1.

    I'm not copy/pasting table 4, but I don't see any example that uses characters that consist of two bytes. Can you share a PDF that contains a name with a two-byte character that behaves in the way you desire? The PDF specification explicitly says that characters in the context of names are 8-bit values. You seem to be talking about 16-bit values...

    Additional note: in the current implementation of iText, we only look at 8 bits:

    c = (char)(chars[k] & 0xff);
    

    We deliberately throw away all the higher bits when characters with more than 8 bits are passed.

    Actually, I think I have answered your question. Initially, I thought you were asking to add this character: http://www.fileformat.info/info/unicode/char/c3a4/index.htm

    As it turns out, you only need "\u00e4" (ä). I've made a small code sample that demonstrates how one would add a custom entry to the DID containing this character: ChangeInfoDictionary.

    public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
        PdfReader reader = new PdfReader(src);
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
        Map<String, String> info = reader.getInfo();
        info.put("Special Character: \u00e4", "\u00e4");
        stamper.setMoreInfo(info);
        stamper.close();
        reader.close();
    }
    

    Granted, when you open the PDF in a PDF viewer, you don't necessarily see "Special Character: ä" as the key value, but that's a problem of the PDF viewer. When you open the PDF in a text editor, you clearly see:

    /Special#20Character:#20#e4(ä)
    

    Which means that iText has correctly escaped the special character.

    However: as you pointed out in your comment, the character doesn't show up in Adobe Reader. Based on a PDF I created using Acrobat, I found a workaround by using the following code:

    StringBuffer buf = new StringBuffer();
    buf.append((char) 0xc3);
    buf.append((char) 0xa4);
    info.put(buf.toString(), "\u00e4");
    

    Now the character is shown correctly. In other words: it's a matter of encoding...