Search code examples
pythonpython-3.xunicodeunicode-normalization

How to solve UnicodeDecodeError when reading file with danish characters?


I have read through similar questions on stack overflow, however non of them solve the unicode problem I have: 'ascii' codec can't decode byte 0xc3 in position 302.

Have tried: import sys reload(sys) sys.setdefaultencoding("utf-8")

however receive an error: NameError: name 'reload' is not defined

I try to read file with danish vowels: æ, ø, å. In return receive 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 position 302 etc. Position 302 and further on include danish vowels. Is there a way to fix this?

So far I have tried putting a specially-formatted comment as the first line of the source code: # -*- coding: <ascii> -*-. Did not give any result.

Also tried: f = open(fname, encoding="ascii", errors="surrogate escape"). But instead of reading file with characters as they are for example in the word "Europæiske" I get "Europ\udcc3\udca6iske".

Then I tried suggestions from the blog (lost a link to that blog) to "import unicodedata", however, it was not well explained where to take it form there.

import unicodedata
import csv

with open('File.csv') as f:
  reader = csv.reader(f)
  for row in reader:
    print(row)

Solution

  • Simply open with the correct encoding. You have to know the encoding that the file was saved in. Western versions of Windows might be Windows-1252, or perhaps utf8. Modules such as chardet can perform an educated guess. Also, for for csv module, open with newline='' as well (see documentation for using csv.reader:

    import csv
    
    with open('File.csv',encoding='utf8',newline='') as f:
      reader = csv.reader(f)
      for row in reader:
        print(row)