Search code examples
pythonpython-2.7unicodestring-formattingnon-ascii-characters

Formatting columns containing non-ascii characters


So I want to align fields containing non-ascii characters. The following does not seem to work:

for word1, word2 in [['hello', 'world'], ['こんにちは', '世界']]:
    print "{:<20} {:<20}".format(word1, word2)

hello                world
こんにちは      世界

Is there a solution?


Solution

  • You are formatting a multi-byte encoded string. You appear to be using UTF-8 to encode your text and that encoding uses multiple bytes per codepoint (between 1 and 4 depending on the specific character). Formatting a string counts bytes, not codepoints, which is one reason why your strings end up misaligned:

    >>> len('hello')
    5
    >>> len('こんにちは')
    15
    >>> len(u'こんにちは')
    5
    

    Format your text as Unicode strings instead, so that you can count codepoints, not bytes:

    for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
        print u"{:<20} {:<20}".format(word1, word2)
    

    Your next problem is that these characters are also wider than most; you have double-wide codepoints:

    >>> import unicodedata
    >>> unicodedata.east_asian_width(u'h')
    'Na'
    >>> unicodedata.east_asian_width(u'世')
    'W'
    >>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    ...     print u"{:<20} {:<20}".format(word1, word2)
    ...
    hello                world
    こんにちは                世界
    

    str.format() is not equipped to deal with that issue; you'll have to manually adjust your column widths before formatting based on how many characters are registered as wider in the Unicode standard.

    This is tricky because there is more than one width available. See the East Asian Width Unicode standard annex; there are narrow, wide and ambigious widths; narrow is the width most other characters print at, wide is double that on my terminal. Ambiguous is... ambiguous as to how wide it'll actually be displayed:

    Ambiguous characters require additional information not contained in the character code to further resolve their width.

    It depends on the context how they are displayed; greek characters for example are displayed as narrow characters in a Western text, but wide in an East Asian context. My terminal displays them as narrow, but other terminals (configured for an east-asian locale, for example) may display them as wide instead. I'm not sure if there are any fool-proof ways of figuring out how that would work.

    For the most part, you need to count characters with a 'W' or 'F' value for unicodedata.east_asian_width() as taking 2 positions; subtract 1 from your format width for each of these:

    def calc_width(target, text):
        return target - sum(unicodedata.east_asian_width(c) in 'WF' for c in text)
    
    for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
        print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))
    

    This then produces the desired alignment in my terminal:

    >>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    ...     print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))
    ...
    hello                world
    こんにちは           世界
    

    The slight misalignment you may see above is your browser or font using a different width ratio (not quite double) for the wide codepoints.

    All this comes with a caveat: not all terminals support the East-Asian Width Unicode property, and display all codepoints at one width only.