Search code examples
pythonpython-3.xpython-2.7setnon-deterministic

Indeterministic sets in Python 2 and 3


Python 2

Sets are collections of unordered values. If I construct a set via a set literal, e.g.

s = {'a', 'b', 'c'}

and then print it, I get the elements in some scrambled order. However, it seems that in Python 2.7, the above example always results in the same ordering:

print(s)  # set(['a', 'c', 'b']) in Python 2.7

How does Python 2.7 decide on this ordering? Even the hashes of 'a', 'b' and 'c' are not in the order produced.

Python 3

In Python 3.x (including 3.6 where dict keys are ordered) the resulting order seems to be random, though always the same within a given Python process. That is, repeatedly re-building the set literal always lead to the same ordering, as long as I do not restart the Python interpreter.

To check the ordering across multiple Python processes, consider the bash code

(for _ in {1..50}; do python3 -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u

This will (most often) show the 6 different ways the 3 elements can be arranged. Switching out python3 with python(2), we only see the ordering ['a', 'c', 'b']. What determines the ordering in Python 3?

I see that the hash value of objects are deterministic in Python 2 while random (though constant within a Python process) in Python 3. I'm sure this is key to the full explanation.

Edit

As deceze writes in his comment, I would like to know if Python explicitly does something just to achieve this randomization, or if it happens "for free".


Solution

  • The reason for the difference in Python 3 (from Python 3.3 onwards) is that hash randomization is enabled by default, you could turn this off by setting the PYTHONHASHSEED environmental variable to a fixed value:

    $ export PYTHONHASHSEED=0
    $ (for _ in {1..50}; do python3  -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u
    {'a', 'b', 'c'}
    

    Equally you can turn hash randomization on in Python 2 with the -R flag:

    $ (for _ in {1..50}; do python2 -R -c "s = {'a', 'b', 'c'}; print(s)"; done) | sort -u
    set(['a', 'b', 'c'])
    set(['a', 'c', 'b'])
    set(['b', 'c', 'a'])
    set(['c', 'b', 'a'])
    

    Note, you don't generally want to turn it off since having hash randomization enabled helps protect against certain denial-of-service attacks.