This post is sequel to Conversion from ttf to type 2 CID font (type 42 base font)
It is futile to have a CID-Keyed font (containingCIDMap
that enforces Identity Mapping i.e Glyph index = CID) without offering Unicode support inherently. Then, how to provide UTF-8
support for such a CID-keyed font externally by an application software?
Note: The application program that uses the CID-keyed font can be written in C, C++, Postscript or any other language.
The CID-keyed font NotoSansTamil-Regular.t42
has been converted from Google's Tamil ttf font.
You need this conversion because without this conversion, a postscript program can't access a truetype font!
Refer Post Conversion from ttf to type 2 CID font (type 42 base font) for conversion.
The CIDMap of t42
font enforces an identity mapping as follows:
Character code 0 maps to Glyph index 0
Character code 1 maps to Glyph index 1
Character code 2 maps to Glyph index 2
......
......
Character code NumGlyphs-1 maps to Glyph index NumGlyphs-1
It is clearly evident that there is no Unicode involved in this mapping inherently.
To understand concretely, edit the following postscript program tamil.ps
that accesses t42
font through postscript's findfont
operator.
%!PS-Adobe-3.0
/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def
13 myNoTo
100 600 moveto
% தமிழ் தங்களை வரவேற்கிறது!
<0019001d002a005e00030019004e00120030002200030024001f002f0024005b0012002a0020007a00aa> show
100 550 moveto
% Tamil Welcomes You!
<0155017201aa019801a500030163018801a5017f01b101aa018801c20003016901b101cb00aa00b5> show
showpage
Issue the following Ghostscript command to execute the postscript program tamil.ps
.
gswin64c.exe "D:\cidfonts\NotoSansTamil-Regular.t42" "D:\cidfonts\tamil.ps
(on Windows Platform).gs ~/cidfonts/NotoSansTamil-Regular.t42 ~/cidfonts/tamil.ps
(on Linux Platform).This will display two strings தமிழ் தங்களை வரவேற்கிறது!
and Tamil Welcomes You!
respectively in subsequent rows.
Note that the strings for show
operator are in Hexadecimal format embedded within angular brackets. Operator show
extracts 2 bytes at a time and maps this CID (16 bit value) to a Glyph.
For example, the first 4 Hex digits in the 1st string is 0019
whose decimal equivalent is 25
. This maps to glyph த
.
In order to use this font t42
, each string (created from character set of a ttf
) should be converted into hexadecimal string by hand which is practically impossible and therefore this font becomes futile.
Now consider the following C++ code that generates a postscript program called myNotoTamil.ps
that accesses the same t42
font through postscript's findfont
operator.
const short lcCharCodeBufSize = 200; // Character Code buffer size.
char bufCharCode[lcCharCodeBufSize]; // Character Code buffer
FILE *fps = fopen ("D:\\cidfonts\\myNotoTamil.ps", "w");
fprintf (fps, "%%!PS-Adobe-3.0\n");
fprintf (fps, "/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def\n");
fprintf (fps, "13 myNoTo\n");
fprintf (fps, "100 600 moveto\n");
fprintf (fps, u8"%% தமிழ் தங்களை வரவேற்கிறது!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"தமிழ் தங்களை வரவேற்கிறது!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "%% Tamil Welcomes You!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"Tamil Welcomes You!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "showpage\n");
fclose (fps);
Although the contents of tamil.ps
and myNotoTamil.ps
are same and identical, the difference in the production of those ps
files is like difference between heaven and earth!
Observe that unlike tamil.ps
(handmade Hexadecimal strings), the myNotoTamil.ps
is generated by a C++ program which uses UTF-8 encoded strings directly hiding the hex strings completely. The function strps produces hex strings from UTF-8 encoded strings which are the same and identical as the strings present in tamil.ps
.
The futile t42
font has suddenly become fruitful due to strps function's mapping ability from UTF-8 to CIDs (every 2 bytes in Hex strings maps to a CID)!
The strps
function consults a mapping table aNotoSansTamilMap
(implemented as a single dimensional array constructed with the help of Unicode Blocks
) in order to map Unicode Points (extracted from UTF-8 encoded string) to Character Identifiers (CIDs).
The buffer bufCharCode
used in strps
function (4th parameter) passes out hex strings corresponding to UTF-8 encoded strings to Postscript's show
operator.
In order to benefit others, I released this UTF8Map
program through GitHub on the following platforms.
Windows 10 Platform (Github Public Repository for UTF8Map Program on Windows 10)
Open up DOS command line and issue the following clone
command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Win
Or execute the following curl
command to download source code release in zip
form:
curl -o UTF8Map-Win-2.0.zip -L https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Or execute the following wget
command to download source code release in zip
form:
wget -O UTF8Map-Win-2.0.zip https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Linux Platform (Github Public Repository for UTF8Map Program on Linux)
Issue the following clone
command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Linux
Or execute the following curl
command to download source code release in tar
form:
curl -o UTF8Map-Linux-1.0.tar.gz -L https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Or execute the following wget
command to download source code release in tar
form:
wget -O UTF8Map-Linux-1.0.tar.gz https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Note:
This program uses t42
file to generates a ps
file (a postscript program) which will display the following in a single page:
The two program files (main.cpp
and mapunicode.h
) are 100% portable. i.e. the contents of two files are same and identical across platforms.
The two mapping tables aNotoSansTamilMap
and aLathaTamilMap
are given in mapunicode.h
file.
A README
Document in Markdown format has been included with the release.
This software has been tested for t42
fonts converted from the following ttf files.