I've got a set filled with value that are present in a JSON, when I print my set I got the following output:
set(['Path\xc3\xa9', 'Synergy Cin\xc3\xa9ma'])
but if I print each element by using a for loop I've got the following output:
Pathé
Synergy Cinéma
Why I don't got the same encoding for each words?
I guess you are using python 2 and it might be related to the default encoding behavior. The value stocked in your set is the "encoded" value and when you use print
(which is based on the underlying __repr__
and/or __str__
methods of the object) you get the decoded/formated output (according to the default system encoding).
You can obtain information about the default encoding used with the function sys.getdefaultencoding()
Note that in python 3, encoding is utf-8
by default (ie. by default "any string created (...) is stored as Unicode", according to the documentation) and you wont have the exact same behavior (you can see in the python 2 snippet that the hashed values, as python set
s are based on them, are the same if your input string is encoded or not) :
Python 2 :
>>> a = b'Path\xc3\xa9'
>>> a
'Path\xc3\xa9'
>>> print(a)
Pathé
>>> sys.getdefaultencoding()
'ascii'
>>> hash('Pathé')
8776754739882320435
>>> hash(b'Path\xc3\xa9')
8776754739882320435
Python 3:
>>> a = b'Path\xc3\xa9'
>>> a
b'Path\xc3\xa9'
>>> print(a)
b'Path\xc3\xa9'
>>> print(a.decode())
Pathé
>>> sys.getdefaultencoding()
'utf-8'
>>> hash("Pathé")
1530394699459763000
>>> hash(b"Path\xc3\xa9")
1621747577200686773