Search code examples
pdfunicodeembed

Trying to embed simple UTF16 character into manually created PDF but failing


I'm trying to manually create a PDF document (using the PDFGen C code on github). This is on a small footprint device with limited storage.

All works fine until I want to embed (say) the Unicode Ohms character (U+2126).

Below is the test file I'm using, which should show "Hello" with an Ohms symbol after the 'H'.

However, it actually shows "H!&ello".

%PDF-1.4
<hex chars removed>
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
2 0 obj
<< /Count 1 /Kids [ 3 0 R ] /Type /Pages >>
endobj
3 0 obj
<< /Contents 4 0 R /MediaBox [ 0 0 500 800 ] /Parent 2 0 R /Resources 5 0 R /Type /Page >>
endobj
4 0 obj
<< /Length 57 >>
stream
BT /F1 24 Tf 175 720 Td <FEFF004821260065006C006C006F> Tj ET
endstream
endobj
5 0 obj
<< /Font << /F1 6 0 R >> >>
endobj
6 0 obj
<< /BaseFont /Courier /Subtype /Type1 /Type /Font >>
endobj
xref
0 7
0000000000 65535 f 
0000000015 00000 n 
0000000064 00000 n 
0000000123 00000 n 
0000000229 00000 n 
0000000335 00000 n 
0000000378 00000 n 
trailer << /Root 1 0 R /Size 7 /ID [<89311a609a751f1666063e6962e79bd5><89311a609a751f1666063e6962e79bd5>] >>
startxref
448
%%EOF

I can only assume my Unicode hex string <FEFF004821260065006C006C006F> is badly formatted.

Or is the Font definition incorrect ?

Or is my understanding of how to embed Unicode wrong ?

I'm ultimately not wanting to embed any fonts as I don't have the storage space or processing power. I just want to add Unicode characters and rely on the PDF renderer to work out how to display them using the default Courier font.

Is that even possible ?

Thanks in advance for any help/advice/comments.

UPDATE

After some useful advice below, I've now managed to achieve what I needed.

I modded my code to switch fonts on a per-character basis between Courier and Symbol and now support (nearly) all the standard characters.

I also added some character scaling to keep the Symbol characters aligned with the Courier font but the end result works for me :)

Here's an image of my test PDF ... enter image description here


Solution

  • Oddly the original PC IBM 437 code set included Ω wiki note i [03A9] (234) but did not make it to Courier ?? You could try coding those few characters you need as an embedded sub-setted symbol font and quite possibly do that using ascii(7bit) or ansi(8bit) but the overheads would be tremendous for your few characters.

    Simpler try switching fonts (as required per target characters) to Symbol font and it could look like this

    enter image description here

    P.S. the codes dont need to be "word" doubles there are only 256 chars.

        << /BaseFont /Symbol /Subtype /Type1 /Type /Font >>
        BT /F2 24 Tf 175 720 Td <4857657C7C6F20766FC27C64> Tj ET
    

    By alternating courier and symbol you will get your desired enter image description here

    In your code it could look something like (with included transforms)

    BT
    /F0 24 Tf 1 0 0 1 0 .0675 Tm (H) Tj
    ET
    BT
    /F1 24 Tf 1 0 0 1 14.4 .0675 Tm <003a> Tj
    ET
    BT
    /F0 24 Tf 1 0 0 1 32.832 .0675 Tm (ello) Tj
    ET
    

    Note my editor used F0 for Courier and F1 for Symbol (base 0 is more normal) Also it used a slightly different code method of defining Omega as <003a>

    Here I am tweaking the text in Windows Notepad to watch how compiling (Ctrl+S) moves the Omega character spacing whilst watching it slide sideways live in the Previewer. Also note that Upper case Omega is W in the raw symbol font !!

    compiling

    So my replacement fix for your code looks like this (You can easily make it look closer to yours, and leaner, by removing white space and line feeds)

    %PDF-1.4
    %µ¶
    
    1 0 obj
    <<
      /Pages 2 0 R
      /Type /Catalog
    >>
    endobj
    
    2 0 obj
    <<
      /Count 1
      /Kids [ 3 0 R ]
      /Type /Pages
    >>
    endobj
    
    3 0 obj
    <<
      /Contents 4 0 R
      /MediaBox [ 0 0 500 800 ]
      /Parent 2 0 R
      /Resources <<
        /Font <<
          /F1 5 0 R
          /F2 6 0 R
        >>
      >>
      /Type /Page
    >>
    endobj
    
    4 0 obj
    <<
      /Length 133
    >>
    stream
    q
    BT
    /F1 24 Tf
    1 0 0 1 175 720 Tm
    (H) Tj
    ET
    BT
    /F2 24 Tf
    1 0 0 1 189 720 Tm
    (W) Tj
    ET
    BT
    /F1 24 Tf
    1 0 0 1 206 720 Tm
    (ello) Tj
    ET
    Q
    
    endstream
    endobj
    
    5 0 obj
    <<
      /BaseFont /Courier
      /Subtype /Type1
      /Type /Font
    >>
    endobj
    
    6 0 obj
    <<
      /BaseFont /Symbol
      /Subtype /Type1
      /Type /Font
    >>
    endobj
    
    xref
    0 7
    0000000000 65536 f 
    0000000016 00000 n 
    0000000070 00000 n 
    0000000136 00000 n 
    0000000307 00000 n 
    0000000494 00000 n 
    0000000569 00000 n 
    
    trailer
    <<
      /Size 7
      /Root 1 0 R
      /ID [ <89311A609A751F1666063E6962E79BD5> <EE408A115072E92E3A34C8BB8BDC6AE6> ]
    >>
    startxref
    643
    %%EOF