Search code examples
pythonpython-3.xwindowsutf-8character-encoding

Editing UTF-8 text file on Windows


I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.

This is the code:

input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
    new_line = line.replace(" ", "+")
    new_line2 = new_line.replace("\t", "+")
    out.write(new_line2)
    #print(new_line2)
fh.close()
out.close()

It gives me an error:

Traceback (most recent call last):
  File "music.py", line 3, in <module>
    for line in input:
  File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>

As music.txt is saved in UTF-8, I changed the first line to:

input = open('music.txt', 'r', encoding="utf8")

This gives another error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>

I tried other things with the out.write() but it didn't work.

This is the raw data of music.txt. https://pastebin.com/FVsVinqW

I saved it in windows editor as UTF-8 .txt file.


Solution

  • If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.

    with open('music.txt', 'r', encoding='utf-8') as infh,\
            open("out.txt", "w", encoding='utf-8') as outfh:
        for line in infh:
            line = line.replace(" ", "+").replace("\t", "+")
            outfh.write(line)
    

    This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.

    Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.