unicode utf-8 ghostscript postscript truetype

How to implement Unicode (UTF-8) support for a CID-keyed font (Adobe's Type 0 Composite font) converted from ttf?

This post is sequel to Conversion from ttf to type 2 CID font (type 42 base font) It is futile to have a CID-Keyed font (containingCIDMap that enforces Identity Mapping i.e Glyph index = CID) without offering Unicode support inherently. Then, how to provide UTF-8 support for such a CID-keyed font externally by an application software?

Note: The application program that uses the CID-keyed font can be written in C, C++, Postscript or any other language.

Solution

The CID-keyed font NotoSansTamil-Regular.t42 has been converted from Google's Tamil ttf font. You need this conversion because without this conversion, a postscript program can't access a truetype font! Refer Post Conversion from ttf to type 2 CID font (type 42 base font) for conversion.

The CIDMap of t42 font enforces an identity mapping as follows:

Character code 0 maps to Glyph index 0
Character code 1 maps to Glyph index 1
Character code 2 maps to Glyph index 2
......
......
Character code NumGlyphs-1 maps to Glyph index NumGlyphs-1

It is clearly evident that there is no Unicode involved in this mapping inherently. To understand concretely, edit the following postscript program tamil.ps that accesses t42 font through postscript's findfont operator.

%!PS-Adobe-3.0
/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def
13 myNoTo
100 600 moveto 
% தமிழ் தங்களை வரவேற்கிறது!
<0019001d002a005e00030019004e00120030002200030024001f002f0024005b0012002a0020007a00aa> show
100 550 moveto 
% Tamil Welcomes You!
<0155017201aa019801a500030163018801a5017f01b101aa018801c20003016901b101cb00aa00b5> show
showpage

Issue the following Ghostscript command to execute the postscript program tamil.ps.

gswin64c.exe "D:\cidfonts\NotoSansTamil-Regular.t42" "D:\cidfonts\tamil.ps (on Windows Platform).
gs ~/cidfonts/NotoSansTamil-Regular.t42 ~/cidfonts/tamil.ps (on Linux Platform).

This will display two strings தமிழ் தங்களை வரவேற்கிறது! and Tamil Welcomes You! respectively in subsequent rows.

Note that the strings for show operator are in Hexadecimal format embedded within angular brackets. Operator show extracts 2 bytes at a time and maps this CID (16 bit value) to a Glyph. For example, the first 4 Hex digits in the 1st string is 0019 whose decimal equivalent is 25. This maps to glyph த.

In order to use this font t42, each string (created from character set of a ttf) should be converted into hexadecimal string by hand which is practically impossible and therefore this font becomes futile.

Now consider the following C++ code that generates a postscript program called myNotoTamil.ps that accesses the same t42 font through postscript's findfont operator.

const short lcCharCodeBufSize = 200;    // Character Code buffer size.
char bufCharCode[lcCharCodeBufSize];    // Character Code buffer
FILE *fps = fopen ("D:\\cidfonts\\myNotoTamil.ps", "w");

fprintf (fps, "%%!PS-Adobe-3.0\n");
fprintf (fps, "/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def\n");
fprintf (fps, "13 myNoTo\n");
fprintf (fps, "100 600 moveto\n");
fprintf (fps, u8"%% தமிழ் தங்களை வரவேற்கிறது!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"தமிழ் தங்களை வரவேற்கிறது!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "%% Tamil Welcomes You!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"Tamil Welcomes You!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "showpage\n");
fclose (fps);

Although the contents of tamil.ps and myNotoTamil.ps are same and identical, the difference in the production of those ps files is like difference between heaven and earth! Observe that unlike tamil.ps(handmade Hexadecimal strings), the myNotoTamil.ps is generated by a C++ program which uses UTF-8 encoded strings directly hiding the hex strings completely. The function strps produces hex strings from UTF-8 encoded strings which are the same and identical as the strings present in tamil.ps.

The futile t42 font has suddenly become fruitful due to strps function's mapping ability from UTF-8 to CIDs (every 2 bytes in Hex strings maps to a CID)!

The strps function consults a mapping table aNotoSansTamilMap (implemented as a single dimensional array constructed with the help of Unicode Blocks) in order to map Unicode Points (extracted from UTF-8 encoded string) to Character Identifiers (CIDs). The buffer bufCharCode used in strps function (4th parameter) passes out hex strings corresponding to UTF-8 encoded strings to Postscript's show operator.

In order to benefit others, I released this UTF8Map program through GitHub on the following platforms.

Windows 10 Platform (Github Public Repository for UTF8Map Program on Windows 10)

Open up DOS command line and issue the following clone command to download source code:
```
git clone https://github.com/marmayogi/UTF8Map-Win
```
Or execute the following curl command to download source code release in zip form:
```
curl -o UTF8Map-Win-2.0.zip -L https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
```
Or execute the following wget command to download source code release in zip form:
```
wget -O UTF8Map-Win-2.0.zip https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
```
Linux Platform (Github Public Repository for UTF8Map Program on Linux)

Issue the following clone command to download source code:
```
git clone https://github.com/marmayogi/UTF8Map-Linux
```
Or execute the following curl command to download source code release in tar form:
```
curl -o UTF8Map-Linux-1.0.tar.gz -L https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
```
Or execute the following wget command to download source code release in tar form:
```
wget -O UTF8Map-Linux-1.0.tar.gz https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
```

Note:

This program uses t42 file to generates a ps file (a postscript program) which will display the following in a single page:
1. A welcome message in Tamil and English.
2. List of Vowels (12 + 1 Glyphs). All of them are associated with Unicode Points.
3. List of Consonants (18 + 6 = 24 Glyphs). No association of Unicode Points.
4. List of combined glyphs (Combination of Vowels + Consonants) in 24 lines. Each line displays 12 glyphs. Out of 288 Glyphs, 24 are associated with Unicode Points and rest do not.
5. List of Numbers in two lines. All 13 Glyphs for Tamil numbers are associated with Unicode Points.
6. A foot Note.
The two program files (main.cpp and mapunicode.h) are 100% portable. i.e. the contents of two files are same and identical across platforms.
The two mapping tables aNotoSansTamilMap and aLathaTamilMap are given in mapunicode.h file.
A README Document in Markdown format has been included with the release.
This software has been tested for t42 fonts converted from the following ttf files.
1. Google's Noto Tamil ttf
2. Microsoft`s Latha Tamil ttf