Embedded OpenType (CFF) font in a PDF shows strange behaviour in some viewers

When embedding a subsetted OpenType font with CFF outlines (Noto Sans HK) in a PDF generated by my own library, I am seeing some rather strange behaviour. The PDF shows no glyphs (un-selectable blanks) in Mac Preview and a random assortment of .notdef's and spaces in Adobe Reader with no errors in either.

But here's the deal: it renders perfectly with Poppler in a Docker image with no fonts installed (I have completely removed every pre-installed font so there would be no silent substitutions) and Chrome on my Mac (without the font being installed).

Furthermore, I have also compared the rendering of my PDF in Chrome to that of a reference PDF using the same font created with Cairo, and as shown below overlaying my PDF on the Cairo one at 50% opacity shows they are definitely identical.

Chrome rendering (Noto HK top, PingFang HK bottom): Preview rendering (Noto HK invisible, PingFang HK as expected): Other HK Chinese CFF fonts like PingFang HK render perfectly in every PDF reader I have tested, but Noto Sans HK just won't. As far as embedding restrictions go, FontBook shows Noto Sans HK as having "no restrictions", so nothing there either.

I am embedding all fonts as CIDFontType0C fonts with Identity-H encoding, and although I'm not providing ToUnicode maps yet as they are the next thing on the roadmap, that should make no difference to rendering.

Noto HK Font objects (Widths removed for conciseness):

6 0 obj
<< /Ascent 1160 /CapHeight 733 /Descent -288 /Flags 4 /FontBBox [ -991 -1050 2930 1810 ] /FontFile3 10 0 R /FontName /NZGUSD+NotoSansHK-Thin /ItalicAngle 0 /StemV 58 /Type /FontDescriptor >>
endobj
7 0 obj
<< /BaseFont /NZGUSD+NotoSansHK-Thin /DescendantFonts [ 8 0 R ] /Encoding /Identity-H /Subtype /Type0 /Type /Font >>
endobj
8 0 obj
<< /BaseFont /NZGUSD+NotoSansHK-Thin /CIDSystemInfo << /Ordering (Identity) /Registry (Adobe) /Supplement 0 >> /FontDescriptor 6 0 R /Subtype /CIDFontType0 /Type /Font /W 9 0 R >>
endobj

Equivalent PingFang objects:

11 0 obj
<< /Ascent 1060 /CapHeight 860 /Descent -340 /Flags 4 /FontBBox [ -72 -212 1126 952 ] /FontFile3 15 0 R /FontName /DYBBAB+PingFangHK-Regular /ItalicAngle 0 /StemV 95 /Type /FontDescriptor >>
endobj
12 0 obj
<< /BaseFont /DYBBAB+PingFangHK-Regular /DescendantFonts [ 13 0 R ] /Encoding /Identity-H /Subtype /Type0 /Type /Font >>
endobj
13 0 obj
<< /BaseFont /DYBBAB+PingFangHK-Regular /CIDSystemInfo << /Ordering (Identity) /Registry (Adobe) /Supplement 0 >> /FontDescriptor 11 0 R /Subtype /CIDFontType0 /Type /Font /W 14 0 R >>
endobj

Relevant Page objects:

3 0 obj
<< /F4v0 12 0 R /F5v0 7 0 R >>
endobj
4 0 obj
<< /Contents 5 0 R /CropBox [ 2.5 4 595 842 ] /MediaBox [ 0 0 600 850 ] /Parent 2 0 R /Resources << /Font 3 0 R >> /Type /Page >>
endobj
5 0 obj
<< /Length 462 >>
stream
q 1 1 1 rg 0 0 600 850 re F Q  BT /F5v0 15.000000 Tf 0 0 0 rg 0 Tr 27.500000 802.000000 Td [<0AFD292728192FFF3162282746BB112F14E410E20E96201D0D820A9111440EC016922CB046A10AFD0EC039AF1D0B272D17D431C92A2B4F4D384719160F2C29C9297634F34F4D1846>] TJ ET  BT /F4v0 15.000000 Tf 0 0 0 rg 0 Tr 27.500000 780.280000 Td [<05487DE1129E161216D412A7726A08C175A77465074A7A1706A504E4748207710B1814B5726605480771641D0E4D12580BD481D113A37267628146D107BE7E0D1358AD3772670C18>] TJ ET endstream
endobj

I'm using HarfBuzz to generate subsets with the HB_SUBSET_FLAGS_RETAIN_GIDS flag set, and when I view the generated subset in FontForge, the glyphs expected are present with the correct GIDs.

Minimal reproducible PDF (not linearised or compressed for readability)

Edit:

Some further investigation showed that embedding the same font as a CIDFontType2 font instead of CIDFontType0 makes Preview show the desired result, which is beyond bizarre to me. Adobe Reader still shows the .notdefs, and Poppler warns about using the wrong type (unsurprisingly) but still renders the PDF fine. My assumption is Preview and Poppler are interpreting the embedded font as CIDFontType0 correctly and ignoring the incorrect /Subtype I've provided.

The question still remains of why Preview would correctly display the font when it's embedded incorrectly, but not otherwise.

Edit 2:

When the font is embedded whole, the result is mostly the same, although now rather than seeing nothing I get a few random characters instead: In chrome the result is the same as before:

The glyphs being rendered definitely do not correspond to the glyph IDs being provided (again, verified with FontForge).

As before, PingFang and other fonts render perfectly in either case.

I'm starting to think I might be missing an edge case here with respect to glyph indexing, where Cairo and other PDF generators are remapping GIDs to low numbers so they have no issues, but I'm retaining the original GIDs (still fitting in 2 bytes, but could be an implementation limitation I haven't seen?).

I'll try remapping the GIDs to see if that helps and report back.

Solution

This is happening because of a misunderstanding on my part of how CID fonts work in PDFs.

Let me explain.

When using a font in PDF you will provide several structures (font descriptor, font dictionary, and for Type0 a descendant font) describing the font, and categorising it into one of the predefined types (Type0, Type1, Type3, or TrueType), and in the case of Type0 a subtype (/CIDFontType0 or /CIDFontType2).

What I didn't understand was that Type0 fonts with subtype /CIDFontType0 actually have one further implicit distinction between those that use CIDFont operators in their TopDICT structure, and those that don't (which includes all CFF2 fonts).

The way glyph lookup works differs based on the type of font used too: With "Simple" fonts (Type1, TrueType) you would use the actual string ((like this) or <0074006800690073>) as the operand to text showing operators, whereas for Composite fonts (Type0) you would typically use hex encoded strings of CIDs (<DEADBEEF...>).

When using Identity mappings with CID fonts, CID == GID so we can use GIDs directly in these strings — unless you're using a CID Font with CFF outlines that has CIDFont operators in its TopDICT. In this (now rather rare) case, CIDs may or may not equal GIDs — in my testing NotoSansHK was the only font that used a different mapping, hence why other fonts worked fine.

What I needed was to parse the charset array in the TopDICT structure, and look up the GID in question to obtain a SID. Normally each SID corresponds to a string in the string index, but in OpenType fonts the SIDs seem to actually encode the CID for the font. Once the CID is obtained, this can be used to encode text in the PDF.

In my case, 人 (U+4EBA) had a GID of 2813, but the PDF reader interpreted that as a CID, which in this case didn't exist. When using the CID of 9749 instead, however, the glyph is shown as expected.