Search code examples
pythonutf-8dna-sequence

Which python encoding type I must use to read a non utf-8 character?


I have to make my python script read a DNA query strings file and do a search with it.

Well, the file contains this type of character:

Screenshot

And python default encoding cannot read this line with the readline() function for files. The following error is raised:

[...]
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 860: invalid start byte

I have tried with utf_16 and ascii too but with no positive results. How can I read this?


Solution

  • You need to first figure out what the actual encoding of the text file you have to read, then use open with that file and the correct encoding argument to open that. The diamond ? is simply a placeholder character in your console so your default system encoding is incompatible with the file you have displayed (and vice versa).

    Alternatively if you do not care about the "junk" characters you can simply 'ignore' or 'replace' for the errors argument. Again please consult the documentation first for options available.