Search code examples
pythonunicodeutf-8python-2.x

Python: what does "...".encode("utf8") fix?


I wanted to url encode a python string and got exceptions with hebrew strings. I couldn't fix it and started doing some guess oriented programming. Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.

Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).


Solution

  • You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.

    The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.