Search code examples
pythoncsvpython-3.xencodingutf-8

Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print


I have the following code in Python 3, which is meant to print out each line in a csv file.

import csv
with open('my_file.csv', 'r', newline='') as csvfile:
    lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')
    for line in lines:
        print(' '.join(line))

But when I run it, it gives me this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.

My problem is that I've looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.


Solution

  • We know the file contains the byte b'\x96' since it is mentioned in the error message:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
    

    Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

    import pkgutil
    import encodings
    import os
    
    def all_encodings():
        modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
        aliases = set(encodings.aliases.aliases.values())
        return modnames.union(aliases)
    
    text = b'\x96'
    for enc in all_encodings():
        try:
            msg = text.decode(enc)
        except Exception:
            continue
        if msg == 'ñ':
            print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))
    

    which yields

    Decoding b'\x96' with mac_roman is ñ
    Decoding b'\x96' with mac_farsi is ñ
    Decoding b'\x96' with mac_croatian is ñ
    Decoding b'\x96' with mac_arabic is ñ
    Decoding b'\x96' with mac_romanian is ñ
    Decoding b'\x96' with mac_iceland is ñ
    Decoding b'\x96' with mac_turkish is ñ
    

    Therefore, try changing

    with open('my_file.csv', 'r', newline='') as csvfile:
    

    to one of those encodings, such as:

    with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile: