Search code examples
pdftextunicodefontsadobe

How to get Unicode Hex Values from a Type 1 text in a pdf file?


I am trying to write a pdf parser in c++. I have some problems to read some texts that are written in languages that do not use the Latin alphabet.

For example I have a text which is described as

T1_0 257 0 R

/T1_0 1 Tf
40.2614 0 0 47.4187 120.4995 595.2451 Tm
[(\037\036)3(\035)21(\034)-8(\033)5(\032\031)]TJ

257 0 obj
<</BaseFont/HVTZBF+MyriadPro-Regular/Encoding 269 0 R/FirstChar 25/FontDescriptor 270 0 R/LastChar 31/Subtype/Type1/Type/Font/Widths[417 555 472 551 457 236 553]>>
endobj

269 0 obj
<</BaseEncoding/WinAnsiEncoding/Differences[25/uni03C2/eta/lambda/alpha/chi/iota/uni03BC]/Type/Encoding>>
endobj

I am not interested in getting the font details, but I am really interested in getting the symbols of this text in unicode. In the "Differences" table there is a name for each symbol of the text. The first and the last sylmbols are in Unicode hex, but the rest are described by their names from Adobe's "Symbol Set and Encoding" table.

e.g. "uni03C2" is "ς", "eta" is "η", "lambda" is "λ" etc

How can I get the Unicode hexadecimal value for each of the symbols of my text?

p.s.: I have also tried to decode the FontFile3 program, but I can not see it's content, except from some information about the font's license.

p.s.2: Here is a link to the file.

Thanks in advance.


Solution

  • You can find the names in the "Adobe Glyph List".

    The uni-prefixes can be translated by removing the prefix which will end in the appropriate UTF-16 hex value. Could you share a link to this type of document?

    The full specification of the AGL is available here.