Search code examples
pythonunicodeglyph

unicode table information about a character in python


Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)

Example: for the letter "Ȅ"

  • Name > Latin Capital Letter E with Double Grave
  • Unicode number > U+0204
  • HTML-code > Ȅ
  • Bloc > Latin Extended-B
  • Lowercase > ȅ

What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").

Roughly:
input = a Unicode number
output = corresponding information

The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.

Thank you.


Solution

  • The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.

    Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.

    Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.

    Code (tested with Python 2.7 and 3.6):

    # -*- coding: utf-8 -*-
    
    class UnicodeCharacter:
        def __init__(self):
            self.code = 0
            self.name = 'unnamed'
            self.category = ''
            self.combining = ''
            self.bidirectional = ''
            self.decomposition = ''
            self.asDecimal = None
            self.asDigit = None
            self.asNumeric = None
            self.mirrored = False
            self.uc1Name = None
            self.comment = ''
            self.uppercase = None
            self.lowercase = None
            self.titlecase = None
            self.block = None
    
        def __getitem__(self, item):
            return getattr(self, item)
    
        def __repr__(self):
            return '{'+self.name+'}'
    
    class UnicodeBlock:
        def __init__(self):
            self.first = 0
            self.last = 0
            self.name = 'unnamed'
    
        def __repr__(self):
            return '{'+self.name+'}'
    
    class BlockList:
        def __init__(self):
            self.blocklist = []
            with open('Blocks.txt','r') as uc_f:
                for line in uc_f:
                    line = line.strip(' \r\n')
                    if '#' in line:
                        line = line.split('#')[0].strip()
                    if line != '':
                        rawdata = line.split(';')
                        block = UnicodeBlock()
                        block.name = rawdata[1].strip()
                        rawdata = rawdata[0].split('..')
                        block.first = int(rawdata[0],16)
                        block.last = int(rawdata[1],16)
                        self.blocklist.append(block)
                # make 100% sure it's sorted, for quicker look-up later
                # (it is usually sorted in the file, but better make sure)
                self.blocklist.sort (key=lambda x: block.first)
    
        def lookup(self,code):
            for item in self.blocklist:
                if code >= item.first and code <= item.last:
                    return item.name
            return None
    
    class UnicodeList:
        """UnicodeList loads Unicode data from the external files
        'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
    
        These files must appear in the same directory as this program.
    
        UnicodeList is a new interpretation of the standard library
        'unicodedata'; you may first want to check if its functionality
        suffices.
    
        As UnicodeList loads its data from an external file, it does not depend
        on the local build from Python (in which the Unicode data gets frozen
        to the then 'current' version).
    
        Initialize with
    
            uclist = UnicodeList()
        """
        def __init__(self):
    
            # we need this first
            blocklist = BlockList()
            bpos = 0
    
            self.codelist = []
            with open('UnicodeData.txt','r') as uc_f:
                for line in uc_f:
                    line = line.strip(' \r\n')
                    if '#' in line:
                        line = line.split('#')[0].strip()
                    if line != '':
                        rawdata = line.strip().split(';')
                        parsed = UnicodeCharacter()
                        parsed.code = int(rawdata[0],16)
                        parsed.characterName = rawdata[1]
                        parsed.category = rawdata[2]
                        parsed.combining = rawdata[3]
                        parsed.bidirectional = rawdata[4]
                        parsed.decomposition = rawdata[5]
                        parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
                        parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
                        # the following value may contain a slash:
                        #  ONE QUARTER ... 1/4
                        # let's make it Python 2.7 compatible :)
                        if '/' in rawdata[8]:
                            rawdata[8] = rawdata[8].replace('/','./')
                            parsed.asNumeric = eval(rawdata[8])
                        else:
                            parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
                        parsed.mirrored = rawdata[9] == 'Y'
                        parsed.uc1Name = rawdata[10]
                        parsed.comment = rawdata[11]
                        parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
                        parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
                        parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
                        while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
                            bpos += 1
                        parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
                        self.codelist.append(parsed)
    
        def find_code(self,codepoint):
            """Find the Unicode information for a codepoint (as int).
    
            Returns:
                a UnicodeCharacter class object or None.
            """
            # the list is unlikely to contain duplicates but I have seen Unicode.org
            # doing that in similar situations. Again, better make sure.
            val = [x for x in self.codelist if codepoint == x.code]
            return val[0] if val else None
    
        def find_char(self,str):
            """Find the Unicode information for a codepoint (as character).
    
            Returns:
                for a single character: a UnicodeCharacter class object or
                None.
                for a multicharacter string: a list of the above, one element
                per character.
            """
            if len(str) > 1:
                result = [self.find_code(ord(x)) for x in str]
                return result
            else:
                return self.find_code(ord(str))
    

    When loaded, you can now look up a character code with

    >>> ul = UnicodeList()     # ONLY NEEDED ONCE!
    >>> print (ul.find_code(0x204))
    {LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
    

    which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:

    >>> print ('%04X' % uc.find_code(0x204).lowercase)
    0205
    >>> print (ul.lookup(0x204).block)
    Latin Extended-B
    

    and (as long as you don't get a None) even chain them:

    >>> print (ul.find_code(ul.find_code(0x204).lowercase))
    {LATIN SMALL LETTER E WITH DOUBLE GRAVE}
    

    It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:

    import unicodedata
    >>> print (unicodedata.name('\U0001F903'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: no such name
    >>> print (uclist.find_code(0x1f903))
    {LEFT HALF CIRCLE WITH FOUR DOTS}
    

    (As tested with Python 3.5.3.)

    There are currently two lookup functions defined:

    • find_code(int) looks up character information by codepoint as an integer.
    • find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.

    After import unicodelist (assuming you saved this as unicodelist.py), you can use

    >>> ul = UnicodeList()
    >>> hex(ul.find_char(u'è').code)
    '0xe8'
    

    to look up the hex code for any character, and a list comprehension such as

    >>> l = [hex(ul.find_char(x).code) for x in 'Hello']
    >>> l
    ['0x48', '0x65', '0x6c', '0x6c', '0x6f']
    

    for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:

     l = [hex(ord(x)) for x in 'Hello']
    

    The purpose of this module is to give easy access to other Unicode properties. A longer example:

    str = 'Héllo...'
    dest = ''
    for i in str:
        dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
    print (dest)
    
    HÉLLO...
    

    and showing a list of properties for a character per your example:

    letter = u'Ȅ'
    print ('Name > '+ul.find_char(letter).name)
    print ('Unicode number > U+%04x' % ul.find_char(letter).code)
    print ('Bloc > '+ul.find_char(letter).block)
    print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
    

    (I left out HTML; these names are not defined in the Unicode standard.)