Search code examples
pythonobjectunicodecharacter-set

Initialize object for unicode fonts


I wrote a class object to access mathematical alphanumeric symbols from the unicode block as described on https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

# San-serif
LATIN_SANSERIF_NORMAL_UPPER = (120224, 120250)
LATIN_SANSERIF_NORMAL_LOWER = (120250, 120276)
LATIN_SANSERIF_BOLD_UPPER = (120276, 120302)
LATIN_SANSERIF_BOLD_LOWER = (120302, 120328)
LATIN_SANSERIF_ITALIC_UPPER = (120328, 120354)
LATIN_SANSERIF_ITALIC_LOWER = (120354, 120380)
LATIN_SANSERIF_BOLDITALIC_UPPER = (120380, 120406)
LATIN_SANSERIF_BOLDITALIC_LOWER = (120406, 120432)

class MathAlphanumeric:
    def __init__(self, script, font, style, case):
        self.script = script
        self.font = font
        self.style = style
        self.case = case
        
    def charset(self):
        start, end = eval('_'.join([self.script, self.font, self.style, self.case]).upper())
        for c in range(start, end):
            yield chr(c)
    
    @staticmethod
    def supported_scripts():
        return {'latin', 'greek', 'digits'}
    
    @staticmethod
    def supported_fonts():
        return {'serif', 'sanserif', 'calligraphy', 'fraktor', 'monospace', 'doublestruck'}
    
    @staticmethod
    def supported_style():
        return {'normal', 'bold', 'italic', 'bold-italic'}
    
    @staticmethod
    def supported_case():
        return {'upper', 'lower'}
         

And to use it, I'll do:

ma = MathAlphanumeric('latin', 'sanserif', 'bold', 'lower')
print(list(ma.charset()))

[out]:

['𝗮', '𝗯', '𝗰', '𝗱', '𝗲', '𝗳', '𝗴', '𝗵', '𝗶', '𝗷', '𝗸', '𝗹', '𝗺', '𝗻', '𝗼', '𝗽', '𝗾', '𝗿', '𝘀', '𝘁', '𝘂', '𝘃', '𝘄', '𝘅', '𝘆', '𝘇']

The code works as expected but to cover all the mathematical alphanum symbols, I'll have to to enumerate through all the start and end symbols from the script * fonts * style * case no. of constants.

My questions are:

  • Is there a better way to create the desired MathAlphanumeric object?
  • Is there a way to avoid the initialisation of script * fonts * style * case no. of constants, in order for MathAlphanumeric.charset() to work as expected?
  • Has an object or function like this been available in some unicode.org related library?

Solution

  • You may be interested in the unicodedata standard library, scpecifically :

    • unicodedata.lookup :

      Look up character by name. If a character with the given name is found, return the corresponding character. If not found, KeyError is raised.

    • unicodedata.name :

      Returns the name assigned to the character chr as a string.

    A quick example :

    >>> import unicodedata
    >>> unicodedata.name(chr(0x1d5a0))
    'MATHEMATICAL SANS-SERIF CAPITAL A'
    >>> unicodedata.lookup("MATHEMATICAL SANS-SERIF CAPITAL A")
    '𝖠'
    >>> unicodedata.name(chr(0x1d504))
    'MATHEMATICAL FRAKTUR CAPITAL A'
    >>> unicodedata.lookup("MATHEMATICAL FRAKTUR CAPITAL A")
    '𝔄'
    

    Now you have to find all the names that unicodedata expects for your use cases, construct the corresponding string from them, and call lookup.

    Here is a mini proof-of-concept :

    import unicodedata
    import string
    
    
    def charset(script: str, font: str, style: str, case: str):
        features = ["MATHEMATICAL"]
        # TODO: use script
        assert font in MathAlphanumeric.supported_fonts(), f"invalid font {font!r}"
        features.append(font.upper())
        assert style in MathAlphanumeric.supported_style(), f"invalid style {style!r}"
        if style != "normal":
            if font == "fraktur":
                features.insert(-1, style.upper())  # "bold" must be before "fraktur"
            elif font in ("monospace", "double-struck"):
                pass  # it has only one style, and it is implicit
            else:
                features.append(style.upper())
        assert case in MathAlphanumeric.supported_case(), f"invalid case {case!r}"
        features.append("CAPITAL" if case == "upper" else "SMALL")
        return tuple(unicodedata.lookup(" ".join(features + [letter]), ) for letter in string.ascii_uppercase)
    
    
    if __name__ == '__main__':
        print("".join(charset("latin", "sans-serif", "bold", "lower")))
        # 𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇
        print("".join(charset("latin", "fraktur", "bold", "upper")))
        # 𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅
        print("".join(charset("latin", "monospace", "bold", "upper")))
        # 𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉
        print("".join(charset("latin", "double-struck", "bold", "upper")))
        # KeyError: "undefined character name 'MATHEMATICAL DOUBLE-STRUCK CAPITAL C'"
    

    (and I changed a bit your supported_fonts method : return {'serif', 'sans-serif', 'calligraphy', 'fraktur', 'monospace', 'double-struck'})

    But there are a lot of caveats in Unicode : it holds all the glyphs you could possibly want, but not organized in a coherent way (due to historical reasons). The failure in my example is caused by :

    >>> unicodedata.name("𝔅")  # the letter copied from the Wikipedia page
    'MATHEMATICAL FRAKTUR CAPITAL B'
    >>> unicodedata.name("ℭ")  # same, but for C
    'BLACK-LETTER CAPITAL C'
    

    So you will need a lot of special cases.

    Also :

    • using eval is considered a bad practice (cf this question), if you can avoid it you should.
    • using the decimal value for unicode "characters" is not convenient, I had to convert from and to hexadecimal to compare your code with the Wikipedia page. Just prefixing with 0x suffices to tell Python it is an hexadecimal value, but apart from looking "strange" it works exactly the same : 0x1d5a0 == 120224 is True.
    • using a class with only one method that gets its parameters from the instance __init__ is considered a smell, you can just make it a function, simpler and cleaner. If what you want is a namespace you could use Python modules instead.
    • the supported scripts, fonts, styles and cases are constant, you could make them class variables instead of putting them in staticmethods.