Search code examples
python-2.7utf-8pyqt4decode

String value decode utf-8


I want to decode string values ​​to utf-8. But it doesn't change. So, here is my code:

self.textEdit_3.append(str(self.new_header).decode("utf-8") + "\n")

The result image is here:

enter image description here

The original output value is:

['matchkey', 'a', 'b', 'd', '안녕'] # 안녕 is Korean Language

I changed the default encoding for encoding / decoding with unicode to utf-8 instead of ascii. On the first line I added this code:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Why doesn't the string value change?

enter image description here


Solution

  • You can fix your code like this:

    header = str(self.new_header).decode('string-escape').decode("utf-8")
    self.textEdit_3.append(header + "\n")
    

    You do not need the setdefaultencoding lines.


    Expanantion:

    The original value is a list containing byte-strings:

    >>> value = ['matchkey', 'a', 'b', 'd', '안녕']
    >>> value
    ['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']
    

    If you convert this list with str, it will use repr on all the list elements:

    >>> strvalue = str(value)
    >>> strvalue
    "['matchkey', 'a', 'b', 'd', '\\xec\\x95\\x88\\xeb\\x85\\x95']"
    

    The repr parts can be decoded like this:

    >>> strvalue = strvalue.decode('string-escape')
    >>> strvalue
    "['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']"
    

    and this can now be decoded to unicode like this:

    >>> univalue = strvalue.decode('utf-8')
    >>> univalue
    u"['matchkey', 'a', 'b', 'd', '\uc548\ub155']"
    >>> print univalue
    ['matchkey', 'a', 'b', 'd', '안녕']
    

    PS:

    Regarding the problems reading files with a utf-8 bom, please test this script:

    # -*- coding: utf-8 -*-
    
    import os, codecs, tempfile
    
    text = u'a,b,d,안녕'
    data = text.encode('utf-8-sig')
    
    print 'text:', repr(text), len(text)
    print 'data:', repr(data), len(data)
    
    f, path = tempfile.mkstemp()
    print 'write:', os.write(f, data)
    os.close(f)
    
    with codecs.open(path, 'r', encoding='utf-8-sig') as f:
        string = f.read()
        print 'read:', repr(string), len(string), string == text