My ultimate aim is to create a mapping from glyph_id
to unicode_chars
. That mapping will be of somewhat order glyph_id --> uni_1, uni_2, uni_3 ...
Since a single glyph can be mapped to many ordered unicode_characters
.
I am looking for some tool or library, preferably in python through which I can access all the meta-information such as table inside of fonts.
Also, I am looking for some solid source through which I can understand the process of mapping multiple Unicode to glyphs.
I know that tools like harfbuzz generate (glyph, position) pair on the given Unicode string. But I am not sure whether it does the reverse or not.
All kind of help will be appreciated thanks.
You should probably check out the fontTools Python library, which has the components you need for working with fonts.
The font table you're interested in is the 'cmap' table, and what you want is basically a reverse mapping of a Unicode mapping subtable (there are several kinds of subtables which can map Unicodes; if you're unfamiliar with this concept, I recommend checking out the OpenType specification for more information). Basically you get the Unicode-to-glyph mapping, and reverse that.
fontTools actually has a nice feature that will automatically select the "best" cmap subtable (it has an ordered list of preferred cmap subtable kinds, and returns the first available in the particular font you have opened). Here's an example using that function:
from fontTools.ttLib import TTFont
from collections import defaultdict
font = TTFont('path/to/fontfile.ttf')
unicode_map = font.getBestCmap()
reverse_unicode_map = defaultdict(list)
for k, v in unicode_map.items():
reverse_unicode_map[v].append(k)
reverse_unicode_map
now holds a mapping of glyph (glyph name) to a list of integer codepoints:
>>> reverse_unicode_map
defaultdict(<class 'list'>, {'.null': [0, 8, 29], 'nonmarkingreturn': [9, 13], 'space': [32], 'exclam': [33], 'quotedbl': [34], 'numbersign': [35], 'dollar': [36], 'percent': [37], 'quotesingle': [39], 'parenleft': [40], 'parenright': [41], 'asterisk': [42], 'plus': [43], 'comma': [44], 'hyphen': [45], 'period': [46], 'slash': [47], 'zero': [48], 'one': [49], 'two': [50], 'three': [51], 'four': [52], 'five': [53]})
You can see that there are 2 glyphs, ".null" and "nonmarkingreturn" that map to more than one Unicode.
If you need to resolve the glyph names to glyph indices, you can use the font.getGlyphID()
method (pass in the glyph name; it will return the corresponding integer ID).