I have the following code:
with open(filename, 'rt') as csvfile:
csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in csvDictReader:
print(row)
Whenever the file size is less than 40k bytes, the program works great. When the file size crosses 40k, I get this error while trying to read the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte
The actual file content doesn't seem to be a problem, only the size of the file itself (40k bytes is really tiny).
When file size is greater than 40K bytes, the error always happens on the line that contains the 32K-th byte.
I have a feeling that python fails to read the file that is more than 40K bytes without an exception, and just truncates it around the 32K-th byte, in the middle. Is that correct? Where is this limit defined?
You have invalid UTF-8 data in your file. This has nothing to do with the csv
module, nor the size of the file; your larger file has invalid data in it, your smaller file does not. Simply doing:
with open(filename) as f:
f.read()
should trigger the same error, and it's purely a matter of encountering an invalid UTF-8 byte, which indicates your file either wasn't UTF-8 to start with, or has been corrupted in some way.
If your file is actually a different encoding (e.g. latin-1
, cp1252
, etc.; the file
command line utility might help with identification, but for many ASCII superset encodings, you just have to know), pass that as the encoding
argument to open
to use instead of the locale default (utf-8
in this case), so you can decode the bytes properly, e.g.:
# Also add newline='' to defer newline processing to csv module, where it's part
# of the CSV dialect
with open(filename, encoding='latin-1', newline='') as csvfile:
csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in csvDictReader:
print(row)