Why set() in Python add to set first element with different character encoding (ASCII, Unicode)? For example
list1, list2 = [u'string' , 'string'], ['string', u'string']
set1, set2 = set(list1), set(list2)
And when I print set1 and set2 they have different outputs
print(set1)
(set([u'string'])
print(set2)
(set(['string']))
Unicode and regular strings with the same ASCII contents get the same hash and are considered equal:
>>> hash(u'string')
-9167918882415130555
>>> hash('string')
-9167918882415130555
>>> u'string' == 'string'
True
Putting two 'equal' objects into a set results in just one object remaining. It then only matters in what order you put in your strings.
In CPython, the first object wins; in your samples, one puts u'string'
first, so adding 'string'
to the same set has no effect, and in the other sample 'string'
is first so adding u'string'
has no effect.
This only applies of both the str
object can be decoded as ASCII. Any data beyond that codepoint and the above no longer holds true; you even get a specific warning when you try to test for equality anyway:
>>> 'stringå' == u'stringå'
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
>>> 'stringå' in set([u'stringå'])
False
>>> set([u'stringå', 'stringå'])
set([u'string\xe5', 'string\xc3\xa5'])
My terminal happens to be set to UTF-8, so entering å
into an interactive session really ends up as the UTF-8 encoded byte sequence C3 A5; this is not decodable as ASCII, so comparisons fail and the str
and unicode
version no longer test as equal and show up as separate objects in a set
. The Python interpreter auto-decoded u'stringå'
from UTF-8 to form the unicode
object.