Search code examples
pythonencodingutf-8decodeiso-8859-1

Replace iso-8859-1 encoding symbols in python open module


How can i decode iso-8859-1 symbols within open funcion.

filename = open(f'/opt/PATH/{shorter}', 'r', encoding='iso-8859-1')
file_content = filename.read()
filename.close()

which gave me ÿ (i guess this was comma):

[...]
11 Dir(s) 3ÿ016ÿ011ÿ776 bytes free
[...]

Solution

  • It's a mojibake case:

    cmd

    >NUL chcp 852
    >dir_cp852.txt dir /C
    type dir_cp852.txt | find /I "bytes free"
    
                  28 Dir(s)  832 467 206 144 bytes free
    
    >NUL chcp 1252
    type dir_cp852.txt | find /I "bytes free"
    
                  28 Dir(s)  832ÿ467ÿ206ÿ144 bytes free
    

    Python

    with open('dir_cp852.txt', 'r', encoding='iso-8859-1') as filename:
        file_content = filename.read()
    
    print(file_content[-52:])
    
                  28 Dir(s)  832ÿ467ÿ206ÿ144 bytes free
    

    Solution:

    with open('dir_cp852.txt', 'r', encoding='cp852') as filename:
        file_content = filename.read()
    
    print(file_content[-52:])
    
                  28 Dir(s)  832 467 206 144 bytes free
    

    Note file_content[-52:] (in Python prompt):

    '              28 Dir(s)  832\xa0467\xa0206\xa0144 bytes free\n'
    

    shows character in mojibake: \xa0   (U+00A0, No-Break Space) with code 0xFF in Code page 852 (and more MS-DOS code pages).


    Please note the /C switch in dir /C above (Display the thousand separator in file sizes).; I have overridden the default by (globally defined) set "DIRCMD=/-C".

    The thousand separator in file sizes is defined in Control Panel\Clock and Region -> Region:
    reg query "HKCU\Control Panel\International" /v sThousand