Search code examples
pythonpython-3.xunicodeutf-8mojibake

python unicode: when written to file, writes in different format


I am using Python 3.4, to write a unicode string to a file. After the file is written, if I open and see, it is totally a different set of characters.

CODE:-

# -*- coding: utf-8 -*-

with open('test.txt', 'w', encoding='utf-8') as f:
    name = 'أبيض'
    name.encode("utf-8")
    f.write(name)
    f.close()    

f = open('test.txt','r')
for line in f.readlines():
    print(line) 

OUTPUT:-

أبيض

Thanks in advance


Solution

  • You need to specify the codec to use when reading as well:

    f = open('test.txt','r', encoding='utf8')
    for line in f.readlines():
        print(line) 
    

    otherwise your system default is used; see the open() function documentation:

    encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.

    Judging by the output you got, your system is using Windows Codepage 1252 as the default:

    >>> 'أبيض'.encode('utf8').decode('cp1252')
    'أبيض'
    

    By using the wrong codec when reading, you created what is called a Mojibake.

    Note that the name.encode('utf8') line in your writing example is entirely redundant; the return value of that call is ignored, and it is the f.write(name) call that takes care of the actual encoding. The f.close() call is also entirely redundant, since the with statement already takes care of closing your file. The following would produce the correct output:

    with open('test.txt', 'w', encoding='utf-8') as f:
        name = 'أبيض'
        f.write(name)
    
    with open('test.txt', 'r', encoding='utf-8') as f:
        for line in f.readlines():
            print(line)