python encoding utf-8 character-encoding

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error: 'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()

Solution

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:

with open('compounds.dat', 'rb') as f:
    data = f.read()

the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.

In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):

# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io

with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
    for line in f:
        if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
            print(line)

will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

In this case, 0x92 is the cp1252 encoding of ’, so your file is likely cp1252, and ignoring errors is the wrong solution, you should just use the proper encoding, changing the with statement to:

with io.open('compounds.dat', encoding='cp1252') as f: