python encoding python-2.7 file-encodings

What's the encoding in which the file is saved when I use open(filename) in Python or fopen(filename) in C?

Runtime Environment: Python 2.7, Windows 7

NOTE:I am talking about the encoding of the file generated by the PYTHON source code(NOT talking about the PYTHON source file's encoding), the encoding declared at the top of the PYTHON source file DID agree with the encoding in which the PYTHON source file was saved.

When there are no non-ascii characters in the string(content = 'abc'), the file(file.txt, NOT the PYTHON source file) is saved in ANSI encoding after fp.close(), the PYTHON file's(and it is saved in ANSI encoding format) content is as below:

## Author: melo
## Email:prevision@imsrch.tk
## Date: 2012/10/12
import os

def write_file(filepath, mode, content):
    try:
        fp = open(filepath, mode)
        try:
            print 'file encoding:', fp.encoding
            print 'file mode:', fp.mode
            print 'file closed?', fp.closed
            fp.write(content)
        finally:
            fp.close()
            print 'file closed?', fp.closed
    except IOError, e:
        print e


if __name__ == '__main__':
    filepath = os.path.join(os.getcwd(), 'file.txt')
    content = 'abc'
    write_file(filepath, 'wb', content)

but when there are some non-ascii characters in the string(content = 'abc莹'), the file(file.txt) will be saved in UTF-8 encoding after fp.close(), although I declared the encoding at the top of the PYTHON source file(not file.txt) with #encoding=gbk. At this time, the PYTHON source file's content is as below:

# -*- encoding: gbk -*-
## Author: melo
## Email:prevision@imsrch.tk
## Date: 2012/10/12
import os

def write_file(filepath, mode, content):
    try:
        fp = open(filepath, mode)
        try:
            print 'file encoding:', fp.encoding
            print 'file mode:', fp.mode
            print 'file closed?', fp.closed
            fp.write(content)
        finally:
            fp.close()
            print 'file closed?', fp.closed
    except IOError, e:
        print e

if __name__ == '__main__':
    filepath = os.path.join(os.getcwd(), 'file.txt')
    content = 'abc莹'
    write_file(filepath, 'wb', content)

Is there any proof that it behaves like this?

Solution

A file is saved in the encoding you save it in. A source file is saved in the encoding you save it in. They don't have to be the same, they just should be declared.

Per your other question, I assume you are using Notepad++ and when you open file.txt you find that Notepad++ thinks the file is encoded in UTF-8 without BOM. This is an incorrect guess by Notepad++. Select the Chinese GB2312 character set and the file will display properly.

Unless given a hint by a byte order mark (BOM) or some other metadata or told by the user, programs have no idea what encoding a file is in.

A correct Python program would do these things:

Declare the encoding of the source file if non-ASCII characters are used in the source file.
Use Unicode strings for all text.
Encode the Unicode strings when output to a binary stream such as a file.
Decode incoming text data to Unicode when read from a binary stream.
(optional) Use an encoding with a byte-order mark so editors know the file encoding.

Example:

# encoding: utf-8
import codecs
with codecs.open('file.txt','wb',encoding='utf-8-sig') as f:
    f.write(u'abc莹')

You should now see in Notepad++ that file.txt is detected as encoded as 'UTF-8' (with BOM) and display the file properly.

Note that you can save the file in 'ANSI' (GBK on your system) if you declare the encoding as gbk and it will still work because Unicode strings were used.

Actually, your system probably is code page 936 (cp936) instead of GBK. They aren't precisely the same. Better to use a Unicode encoding like UTF-8 or UTF-16 which can represent all Unicode characters accurately.