I wrote a function that takes one variable, file
, which is a big .csv
document. I get the following error immediately after calling the function for one specific file (the file is in German):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
The system's default encoding is utf-8
, but if I open('C:/Users/me/Desktop/data/myfile.csv')
, the output is:
<_io.TextIOWrapper name='C:/Users/me/Desktop/data/myfile.csv' mode='r' encoding='cp1252'>
.
Using file.decode('cp1252').encode('utf8')
doesn't work since 'str' object has no attribute 'decode'
, so I tried:
for decodedLine in open('C:/Users/me/Desktop/data/myfile.csv', 'r', encoding='cp1252'):
line = decodedLine.split('\t')
but line
is a list object and I can't .encode()
it.
How can I make .csv
files that have a different encoding readable?
If I understand correctly, you have a csv
file with cp1252
encoding.
If that is the case, all you have to do is open the file with the right encoding.
As far as the csv
is concerned, I would use the csv
module from the standard library.
Alternatively, you may want to look into a more specialized library like pandas
.
Anyway, to parse your csv
you could do just:
import csv
with open(filepath, 'r', encoding='cp1252') as file_obj:
# adjust the parameters according to your file, see docs for more
csv_obj = csv.reader(file_obj, delimiter='\t', quotechar='"')
for row in csv_obj:
# row is a list of entries
# this would print all entries, separated by commas
print(', '.join(row))