Unicode characters in document info dictionary keys

How do I create document info dictionary keys containing unicode characters (typically swedish characters, for instance C3A4 U+00E4 ä). I would like to use the PdfStamper to enter my own metadata in the document info dictionary, but I can't get it to accept the swedish characters.

Entering custom metadata using Acrobat works fine and looking at the PDF in a text editor I can see that the characters get encoded like for instance #C3#A4 for the character mentioned above. So is there a way to achieve this programmatically using iText PdfStamper???

regards Mattias

PS. There is no problem having unicode characters in the info dictionary values, but the keys are a different story.

Solution

Please take a look at the NameObject example, and give it a try. You'll see that iText automatically escapes special characters in names.

iText follows the ISO-32000-1 specification that stats (7.3.5, Name Objects):

Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of the same sequence of characters denote the same object. Atomic means that a name has no internal structure; although it is defined by a sequence of characters, those characters are not considered elements of the name.

not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:

a) A NUMBER SIGN (23h) (#) in a name shall be written by using its 2-digit hexadecimal code (23), preceded by the NUMBER SIGN.

b) Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.

NOTE 1: There is not a unique encoding of names into the PDF file because regular characters may be coded in either of two ways.

White space used as part of a name shall always be coded using the 2-digit hexadecimal notation and no white space may intervene between the SOLIDUS and the encoded name.

Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation.

The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters.

NOTE 2 The examples shown in Table 4 and containing # are not valid literal names in PDF 1.0 or 1.1.

I'm not copy/pasting table 4, but I don't see any example that uses characters that consist of two bytes. Can you share a PDF that contains a name with a two-byte character that behaves in the way you desire? The PDF specification explicitly says that characters in the context of names are 8-bit values. You seem to be talking about 16-bit values...

Additional note: in the current implementation of iText, we only look at 8 bits:

c = (char)(chars[k] & 0xff);

We deliberately throw away all the higher bits when characters with more than 8 bits are passed.

Actually, I think I have answered your question. Initially, I thought you were asking to add this character: http://www.fileformat.info/info/unicode/char/c3a4/index.htm

As it turns out, you only need "\u00e4" (ä). I've made a small code sample that demonstrates how one would add a custom entry to the DID containing this character: ChangeInfoDictionary.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    Map<String, String> info = reader.getInfo();
    info.put("Special Character: \u00e4", "\u00e4");
    stamper.setMoreInfo(info);
    stamper.close();
    reader.close();
}

Granted, when you open the PDF in a PDF viewer, you don't necessarily see "Special Character: ä" as the key value, but that's a problem of the PDF viewer. When you open the PDF in a text editor, you clearly see:

/Special#20Character:#20#e4(ä)

Which means that iText has correctly escaped the special character.

However: as you pointed out in your comment, the character doesn't show up in Adobe Reader. Based on a PDF I created using Acrobat, I found a workaround by using the following code:

StringBuffer buf = new StringBuffer();
buf.append((char) 0xc3);
buf.append((char) 0xa4);
info.put(buf.toString(), "\u00e4");

Now the character is shown correctly. In other words: it's a matter of encoding...