Search code examples
pythonunicodesetpython-unicode

Why set in python choose first element from list of string with different character encoding?


Why set() in Python add to set first element with different character encoding (ASCII, Unicode)? For example

list1, list2 = [u'string' , 'string'], ['string', u'string']
set1, set2 = set(list1), set(list2)

And when I print set1 and set2 they have different outputs

print(set1)
(set([u'string'])

print(set2)
(set(['string']))

Solution

  • Unicode and regular strings with the same ASCII contents get the same hash and are considered equal:

    >>> hash(u'string')
    -9167918882415130555
    >>> hash('string')
    -9167918882415130555
    >>> u'string' == 'string'
    True
    

    Putting two 'equal' objects into a set results in just one object remaining. It then only matters in what order you put in your strings.

    In CPython, the first object wins; in your samples, one puts u'string' first, so adding 'string' to the same set has no effect, and in the other sample 'string' is first so adding u'string' has no effect.

    This only applies of both the str object can be decoded as ASCII. Any data beyond that codepoint and the above no longer holds true; you even get a specific warning when you try to test for equality anyway:

    >>> 'stringå' == u'stringå'
    __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
    False
    >>> 'stringå' in set([u'stringå'])
    False
    >>> set([u'stringå', 'stringå'])
    set([u'string\xe5', 'string\xc3\xa5'])
    

    My terminal happens to be set to UTF-8, so entering å into an interactive session really ends up as the UTF-8 encoded byte sequence C3 A5; this is not decodable as ASCII, so comparisons fail and the str and unicode version no longer test as equal and show up as separate objects in a set. The Python interpreter auto-decoded u'stringå' from UTF-8 to form the unicode object.