Search code examples
pythoncsvencodingutf-8

Not understanding given byte and position in UnicodeDecodeError error message


I have a csv file which looks like this:

Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;Thisisatext;
VESELÝ

So these are exactly 40 lines containing a row with "Thisisatext;" and so on and the last one has a name with a special accent on the y.

I now tried to open this file with the following python code:

import csv
myfile = r'S:\folder\myfile.csv'

with open(myfile) as infile: #,  encoding="utf-8-sig" # , encoding="cp1252"
        cr = csv.reader(infile, delimiter=";")
        for line in cr:
            print(line)

Which does not work and leads to an error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 57: character maps to <undefined>

And indeed, reason is that I should use an encoding specifiyng utf-8-sig. The error message comes, as it seems by my operating system the standard is cp1252. When I try it with cp1252 I get the same error. When I try it with utf-8-sig it works. As the character y with the accent is contained in the utf 8 encoding while it seems that it is not in the cp1252.

When I open the file in notepad++ and place the cursor to the point in the row 41 where the Ý stands, it says position 8248. When I mark the Ý and use the notepad++ converter to convert ascii to hex, it displays:

ex1

The python error message says can't decode byte 0x9d in position 57.

When I remove one line (for example the first one) according to notepad++ the Ý now stands at the position 8042. When I run the code on the new file again, I get the error message:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 8043: character maps to <undefined>

So now the position is 8043. As (almost) stated in notepad++.

I have two questions:

First one: Why does the python error message mention byte 0x9d? When I check the converted result in notepad++ it looks like in the image, C39D? So 0x tells me that python is talking about hex and when I check 9d here it says:

"NOTE: Decimal 128-129 (0x80-81), 141-144 (0x8D-90) and 157-158 (0x9D-9E) are non-printable characters."

But why is the notepad++ converted result C39D? According to this table C3 would be î.

In general I suppose that the accent 9D which can be added to different characters is the problem. But I am not understanding the details.

Second one: Why and how does python calculate the position? Clearly there is a big mismatch in the first case, when python says 'charmap' codec can't decode byte 0x9d in position 57. How does python come up with 57? And when I remove one line it suddently is correct with a complete different number. It seems like python reads in the file in parts and gives me the position within a specific part? Is it possible to get the "real" correct position? If I have a very large file and cannot manually search it, how can I identify the exact position of the sign causing problems?


Solution

  • For the second question, if you run the code in a debugger, you'll see the code returns an exception (not exactly the same as yours, but close:

    Traceback (most recent call last):
      File "C:\test.py", line 5, in <module>
        for line in cr:
      File "D:\dev\Python311\Lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 54: character maps to <undefined>
    

    In my debugger, the value of input is:

    [Dbg]>>> input
    b'isisatext;Thisisatext;Thisisatext;Thisisatext;\r\nVESEL\xc3\x9d\r\n'
    [Dbg]>>> hex(input[54])
    '0x9d'
    

    The takeaway is that the offset is in the current chunk of the file being read and not the offset in the entire file. Based on the file being slightly more than 8K in size and that the offset changes to slightly less than 8K when a line is deleted, the file is likely read in 8K chunks. To locate the correct location in the file read it all at once and decode it as a continuous blob of data:

    >>> with open('myfile.csv', 'rb') as f:
    ...     data = f.read()
    ...     
    >>> data.decode('cp1252')
    Traceback (most recent call last):
      File "<interactive input>", line 1, in <module>
      File "D:\dev\Python311\Lib\encodings\cp1252.py", line 15, in decode
        return codecs.charmap_decode(input,errors,decoding_table)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 8246: character maps to <undefined>
    >>> hex(data[8246])
    '0x9d'
    

    8246 is the actual offset of byte 0x9D in the file.

    For the first question, C3 9D together make the multibyte UTF-8 encoding of Ý. In cp1252, C3 decodes as à but 9D is undefined. If you decode up to but not including offset 8246 it works without error, which is why it excepts on 9D:

    >>> data[8240:8246].decode('cp1252')
    'VESELÃ'