Search code examples
pythoniolatin1

Writing and then reading a string in file encoded in latin1


Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :

s='On écrit ça dans un fichier.'
with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
    print(s, file=f)
with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
    f.write(s)

The second one reads the same files with the same encoding :

with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
    s1=f.read()
with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
    s2=f.read()

Now, printing s1 and s2 I get

On écrit ça dans un fichier.

instead of the initial "On écrit ça dans un fichier."

What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...

Could someone help me ?


Solution

  • Your data was written out as UTF-8:

    >>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
    'On écrit ça dans un fichier.'
    

    This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.

    If you saved your Python script with a header like:

    # -*- coding: latin-1 -*-
    

    but your text editor saved the file with UTF-8 encoding instead, then the string literal:

    s='On écrit ça dans un fichier.'
    

    will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.

    To debug, please take a close look at print(s.encode('unicode_escape')) in the first script. If it looks like:

    b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'
    

    then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:

    b'On \\xe9crit \\xe7a dans un fichier.'
    

    If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx unicode code points:

    s = 'On \u00e9crit \u007aa dans un fichier.'
    

    or \xaa one-byte escape codes for code-points below 256:

    s = 'On \xe9crit \x7aa dans un fichier.'