Search code examples
pythonhashsetcpythonpython-internals

Why does tuple(set([1,"a","b","c","z","f"])) == tuple(set(["a","b","c","z","f",1])) 85% of the time with hash randomization enabled?


Given Zero Piraeus' answer to another question, we have that

x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

Prints True about 85% of the time with hash randomization enabled. Why 85%?


Solution

  • I'm going to assume any readers of this question to have read both:

    The first thing to note is that hash randomization is decided on interpreter start-up.

    The hash of each letter will be the same for both sets, so the only thing that can matter is if there is a collision (where order will be affected).


    By the deductions of that second link we know the backing array for these sets starts at length 8:

    _ _ _ _ _ _ _ _
    

    In the first case, we insert 1:

    _ 1 _ _ _ _ _ _
    

    and then insert the rest:

    α 1 ? ? ? ? ? ?
    

    Then it is rehashed to size 32:

        1 can't collide with α as α is an even hash
      ↓ so 1 is inserted at slot 1 first
    ? 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
    

    In the second case, we insert the rest:

    ? β ? ? ? ? ? ?
    

    And then try to insert 1:

        Try to insert 1 here, but will
      ↓ be rehashed if β exists
    ? β ? ? ? ? ? ?
    

    And then it will be rehashed:

        Try to insert 1 here, but will
        be rehashed if β exists and has
      ↓ not rehashed somewhere else
    ? β ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
    

    So whether the iteration orders are different depends solely on whether β exists.


    The chance of a β is the chance that any of the 5 letters will hash to 1 modulo 8 and hash to 1 modulo 32.

    Since anything that hashes to 1 modulo 32 also hashes to 1 modulo 8, we want to find the chance that of the 32 slots, one of the five is in slot 1:

    5 (number of letters) / 32 (number of slots)
    

    5/32 is 0.15625, so there is a 15.625% chance¹ of the orders being different between the two set constructions.


    Not very strangely at all, this is exactly what Zero Piraeus measured.


    ¹Technically even this isn't obvious. We can pretend every one of the 5 hashes uniquely because of rehashing, but because of linear probing it's actually more likely for "bunched" structures to occur... but because we're only looking at whether a single slot is occupied, this doesn't actually affect us.