Search code examples
pythonunicodestring-formatting

How to control padding of Unicode string containing east Asia characters


I got three UTF-8 stings:

hello, world
hello, 世界
hello, 世rld

I only want the first 10 ascii-char-width so that the bracket in one column:

[hello, wor]
[hello, 世 ]
[hello, 世r]

In console:

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

One chinese char is three bytes, but it only 2 ascii chars width when displayed in console:

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

python's format() doesn't help when UTF-8 chars mixed in

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

It's not pretty:

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

So, I wonder if there is a standard way to do the UTF-8 padding staff?


Solution

  • When trying to line up ASCII text with Chinese in fixed-width font, there is a set of full width versions of the printable ASCII characters. Below I made a translation table of ASCII to full width version:

    # coding: utf8
    
    # full width versions (SPACE is non-contiguous with ! through ~)
    SPACE = '\N{IDEOGRAPHIC SPACE}'
    EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
    TILDE = '\N{FULLWIDTH TILDE}'
    
    # strings of ASCII and full-width characters (same order)
    west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
    east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))
    
    # build the translation table
    full = str.maketrans(west,east)
    
    data = '''\
    蝴蝶(A song)
    心之城(Another song)
    支持你的爱人(Yet another song)
    根生的种子
    鸽子歌(Cucurrucucu palo whatever)
    林地之间
    蓝光
    在你眼里
    肖邦离别曲
    西行(魔戒王者再临主题曲)(Into something)
    深陷爱河
    钟爱大地
    时光流逝
    卡农
    舒伯特小夜曲(SERENADE)
    甜蜜的摇篮曲(Sweet Lullaby)
    '''
    
    # Replace the ASCII characters with full width, and create a song list.
    data = data.translate(full).rstrip().split('\n')
    
    # translate each printable line.
    print(' ----------Songs-----------'.translate(full))
    for i,song in enumerate(data):
        line = '|{:4}: {:20.20}|'.format(i+1,song)
        print(line.translate(full))
    print(' --------------------------'.translate(full))
    

    Output

     ----------Songs-----------
    |   1: 蝴蝶(A song)          |
    |   2: 心之城(Another song)   |
    |   3: 支持你的爱人(Yet another s|
    |   4: 根生的种子               |
    |   5: 鸽子歌(Cucurrucucu palo|
    |   6: 林地之间                |
    |   7: 蓝光                  |
    |   8: 在你眼里                |
    |   9: 肖邦离别曲               |
    |  10: 西行(魔戒王者再临主题曲)(Into s|
    |  11: 深陷爱河                |
    |  12: 钟爱大地                |
    |  13: 时光流逝                |
    |  14: 卡农                  |
    |  15: 舒伯特小夜曲(SERENADE)    |
    |  16: 甜蜜的摇篮曲(Sweet Lullaby|
     --------------------------
    

    It's not overly pretty, but it lines up.