Search code examples
pythonpython-2.7utf-8curses

Selected character maps within utf-8


I wish to use the cp437 character map from the utf-8 encoding.

I have all the code points for each of the cp437 characters.

The following code correctly displays a single cp437 character:

import locale
locale.setlocale(locale.LC_ALL, '')
icon u'\u263A'.encode('utf-8')
print icon

Whereas the following code shows most of the cp437 characters, but not all:

for i in range(0x00,0x100):
    print chr(i).decode('cp437')

My guess is that the 2nd approach is not referencing the utf-8 encoding, but a separate incomplete cp437 character set.

I would like a way to summon a cp437 character from the utf-8 without having to specify each of the 256 individual code points. I have resorted to manually typing the unicode code point strings in a massive 16x16 table. Is there a better way?

The following code demonstrates this:

from curses import *
import locale
locale.setlocale(locale.LC_ALL, '')

def main(stdscr):
    maxyx = stdscr.getmaxyx()
    text= str(maxyx)
    y_mid=maxyx[0]//2
    x_mid=maxyx[1]//2
    next_y,next_x = y_mid, x_mid
    curs_set(1)
    noecho()
    event=1
    y=0; x=0
    icon1=u'\u2302'.encode('utf-8')
    icon2=chr(0x7F).decode('cp437')

    while event !=ord('q'):
        stdscr.addstr(y_mid,x_mid-10,icon1)
        stdscr.addstr(y_mid,x_mid+10,icon2)
        event = stdscr.getch()

wrapper(main)    

The icon on left is from utf-8 and does print to screen. The icon on the right is from decode('cp437') and does not print to screen correctly [appears as ^?]


Solution

  • As mentioned by @Martijn in the comments, the stock cp437 decoder converts characters 0-127 straight into their ASCII equivalents. For some applications this would be the right thing, as you wouldn't for example want '\n' to translate to u'\u25d9'. But for full fidelity to the code page, you need a custom decoder and encoder.

    The codec module makes it easy to add your own codecs, but examples are hard to find. I used the sample at http://pymotw.com/2/codecs/ along with the Wikipedia table for Code page 437 to generate this module - it automatically registers a codec with the name 'cp437ex' when you import it.

    import codecs
    
    codec_name = 'cp437ex'
    
    _table = u'\0\u263a\u263b\u2665\u2666\u2663\u2660\u2022\u25d8\u25cb\u25d9\u2642\u2640\u266a\u266b\u263c\u25ba\u25c4\u2195\u203c\xb6\xa7\u25ac\u21a8\u2191\u2193\u2192\u2190\u221f\u2194\u25b2\u25bc !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u2302\xc7\xfc\xe9\xe2\xe4\xe0\xe5\xe7\xea\xeb\xe8\xef\xee\xec\xc4\xc5\xc9\xe6\xc6\xf4\xf6\xf2\xfb\xf9\xff\xd6\xdc\xa2\xa3\xa5\u20a7\u0192\xe1\xed\xf3\xfa\xf1\xd1\xaa\xba\xbf\u2310\xac\xbd\xbc\xa1\xab\xbb\u2591\u2592\u2593\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b\u2510\u2514\u2534\u252c\u251c\u2500\u253c\u255e\u255f\u255a\u2554\u2569\u2566\u2560\u2550\u256c\u2567\u2568\u2564\u2565\u2559\u2558\u2552\u2553\u256b\u256a\u2518\u250c\u2588\u2584\u258c\u2590\u2580\u03b1\xdf\u0393\u03c0\u03a3\u03c3\xb5\u03c4\u03a6\u0398\u03a9\u03b4\u221e\u03c6\u03b5\u2229\u2261\xb1\u2265\u2264\u2320\u2321\xf7\u2248\xb0\u2219\xb7\u221a\u207f\xb2\u25a0\xa0'
    
    decoding_map = { i: ord(ch) for i, ch in enumerate(_table) }
    
    encoding_map = codecs.make_encoding_map(decoding_map)
    
    class Codec(codecs.Codec):
        def encode(self, input, errors='strict'):
            return codecs.charmap_encode(input, errors, encoding_map)
    
        def decode(self, input, errors='strict'):
            return codecs.charmap_decode(input, errors, decoding_map)
    
    
    class IncrementalEncoder(codecs.IncrementalEncoder):
        def encode(self, input, final=False):
            return codecs.charmap_encode(input, self.errors, encoding_map)[0]
    
    class IncrementalDecoder(codecs.IncrementalDecoder):
        def decode(self, input, final=False):
            return codecs.charmap_decode(input, self.errors, decoding_map)[0]
    
    
    class StreamReader(Codec, codecs.StreamReader):
        pass
    
    class StreamWriter(Codec, codecs.StreamWriter):
        pass
    
    
    def _register(encoding):
        if encoding == codec_name:
            return codecs.CodecInfo(
                name=codec_name,
                encode=Codec().encode,
                decode=Codec().decode,
                incrementalencoder=IncrementalEncoder,
                incrementaldecoder=IncrementalDecoder,
                streamreader=StreamReader,
                streamwriter=StreamWriter)
    
    codecs.register(_register)
    

    Also note that decode produces Unicode strings, while encode produces byte strings. Printing a Unicode string should always work, but your question indicates you may have an incorrect default encoding. One of these should work:

    icon2='\x7f'.decode('cp437ex')
    icon2='\x7f'.decode('cp437ex').encode('utf-8')