Search code examples
pythonemailcharacter-encodinginvalid-characters

Is there a Python library function which attempts to guess the character-encoding of some bytes?


I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError.

So, I'm looking for a function that takes a str and optionally some hints and does its darndest to give me back a unicode. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.

I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".


Solution

  • As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.

    def decode(s, encodings=('ascii', 'utf8', 'latin1')):
        for encoding in encodings:
            try:
                return s.decode(encoding)
            except UnicodeDecodeError:
                pass
        return s.decode('ascii', 'ignore')