Why does csv.reader fail when the file size is larger than 40K bytes?

I have the following code:

with open(filename, 'rt') as csvfile:
    csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
    for row in csvDictReader:
        print(row)

Whenever the file size is less than 40k bytes, the program works great. When the file size crosses 40k, I get this error while trying to read the file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte

The actual file content doesn't seem to be a problem, only the size of the file itself (40k bytes is really tiny).

When file size is greater than 40K bytes, the error always happens on the line that contains the 32K-th byte.

I have a feeling that python fails to read the file that is more than 40K bytes without an exception, and just truncates it around the 32K-th byte, in the middle. Is that correct? Where is this limit defined?

Solution

You have invalid UTF-8 data in your file. This has nothing to do with the csv module, nor the size of the file; your larger file has invalid data in it, your smaller file does not. Simply doing:

 with open(filename) as f:
     f.read()

should trigger the same error, and it's purely a matter of encountering an invalid UTF-8 byte, which indicates your file either wasn't UTF-8 to start with, or has been corrupted in some way.

If your file is actually a different encoding (e.g. latin-1, cp1252, etc.; the file command line utility might help with identification, but for many ASCII superset encodings, you just have to know), pass that as the encoding argument to open to use instead of the locale default (utf-8 in this case), so you can decode the bytes properly, e.g.:

    # Also add newline='' to defer newline processing to csv module, where it's part
    # of the CSV dialect
    with open(filename, encoding='latin-1', newline='') as csvfile:
        csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
        for row in csvDictReader:
            print(row)