Search code examples
pythonpython-3.xcsvencodingdecoding

Using the right encoding for csv file in Python 3


I wrote a function that takes one variable, file, which is a big .csv document. I get the following error immediately after calling the function for one specific file (the file is in German):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 4: invalid continuation byte

The system's default encoding is utf-8, but if I open('C:/Users/me/Desktop/data/myfile.csv'), the output is:

<_io.TextIOWrapper name='C:/Users/me/Desktop/data/myfile.csv' mode='r' encoding='cp1252'>.

Using file.decode('cp1252').encode('utf8') doesn't work since 'str' object has no attribute 'decode', so I tried:

for decodedLine in open('C:/Users/me/Desktop/data/myfile.csv', 'r', encoding='cp1252'):
    line = decodedLine.split('\t')

but line is a list object and I can't .encode() it.

How can I make .csv files that have a different encoding readable?


Solution

  • If I understand correctly, you have a csv file with cp1252 encoding. If that is the case, all you have to do is open the file with the right encoding. As far as the csv is concerned, I would use the csv module from the standard library. Alternatively, you may want to look into a more specialized library like pandas.

    Anyway, to parse your csv you could do just:

    import csv
    
    with open(filepath, 'r', encoding='cp1252') as file_obj:
        # adjust the parameters according to your file, see docs for more
        csv_obj = csv.reader(file_obj, delimiter='\t', quotechar='"')
        for row in csv_obj:
            # row is a list of entries
            # this would print all entries, separated by commas
            print(', '.join(row))