Search code examples
pythonjsonstringunicodetype-conversion

How do I convert a string into bytes in python?


In my code, I encode a string with utf-8. I get the output, convert it to a string, and send it to my other program. The other program gets this string, but, when I try to decode the string, it gives me an error, AttributeError: 'str' object has no attribute 'decode'. I need to send the encoded data as a string because my other program receives it in a json. My first program is in python 3, and the other program is in python 2.

# my first program
x = u"宇宙"
x = str(x.encode('utf-8'))


# my other program
text = x.decode('utf-8')
print(text)

What should I do to convert the string received by the second program to bytes so the decode works?


Solution

  • The most important part to properly answer this is the information on how you pass these objetcts to the Python2 program: you are using JSON.

    So, stay with me:

    After you do the .encode step in program 1, you have a bytes object. By calling str(...) on it, you are just putting a escaping layer on this bytes object, and turning it back to a string - but when this string is written as is to a file, or transmited over the network, it will be encoded again - any non-ASCII tokens are usually escaped with the \u prefix and the codepoint for each character - but the original Chinese chracters themselves are now encoded in utf-8 and doubly-escaped.

    Python's JSON load methods already decode the contents of json data into text-strings: so a decode method is not to be expected at all.

    In short: to pass data around, simply encode your original text as JSON in the first program, and do not botter with any decoding after json.load on the target Python 2 program:

    # my first program
    x = "宇宙"
    # No str-encode-decode dance needed here.
    ...
    data =  json.dumps({"example_key": x, ...})
    # code to transmit json string by network or file as it is...
    
    
    # my other program
    text = json.loads(data)["example_key"]
    # text is a Unicode text string ready to be used!
    

    As you are doing, you are probably gettint the text doubly-encoded - I will mimick it on the Python 3 console. I will print the result from each step so you can undestand the transforms that are taking place.

    In [1]: import json
    
    In [2]: x = "宇宙"
    
    In [3]: print(x.encode("utf-8"))
    b'\xe5\xae\x87\xe5\xae\x99'
    
    In [4]: text = str(x.encode("utf-8"))
    
    In [5]: print(text)
    b'\xe5\xae\x87\xe5\xae\x99'
    
    In [6]: json_data = json.dumps(text)
    
    In [7]: print(json_data)
    "b'\\xe5\\xae\\x87\\xe5\\xae\\x99'"
    # as you can see, it is doubly escaped, and it is mostly useless in this form
    
    In [8]: recovered_from_json = json.loads(json_data)
    
    In [9]: print(recovered_from_json)
    b'\xe5\xae\x87\xe5\xae\x99'
    
    In [10]: print(repr(recovered_from_json))
    "b'\\xe5\\xae\\x87\\xe5\\xae\\x99'"
    
    In [11]: # and if you have data like this in files/databases you need to recover:
    
    In [12]: import ast
    
    In [13]: recovered_text = ast.literal_eval(recovered_from_json).decode("utf-8")
    
    In [14]: print(recovered_text)
    宇宙