python list if-statement dictionary cpython

`object in list` behaves different from `object in dict`?

I've got an iterator with some objects in it and I wanted to create a collection of uniqueUsers in which I only list every user once. So playing around a bit I tried it with both a list and a dict:

>>> for m in ms: print m.to_user  # let's first look what's inside ms
...
Pete Kramer
Pete Kramer
Pete Kramer
>>> 
>>> uniqueUsers = []  # Create an empty list
>>> for m in ms:
...     if m.to_user not in uniqueUsers:
...         uniqueUsers.append(m.to_user)
...
>>> uniqueUsers
[Pete Kramer]  # This is what I would expect
>>> 
>>> uniqueUsers = {}  # Now let's create a dict
>>> for m in ms:
...     if m.to_user not in uniqueUsers:
...         uniqueUsers[m.to_user] = 1
...
>>> uniqueUsers
{Pete Kramer: 1, Pete Kramer: 1, Pete Kramer: 1}

So I tested it by converting the dict to a list when doing the if statement, and that works as I would expect it to:

>>> uniqueUsers = {}
>>> for m in ms:
...     if m.to_user not in list(uniqueUsers):
...         uniqueUsers[m.to_user] = 1
...
>>> uniqueUsers
{Pete Kramer: 1}

and I can get a similar result by testing against uniqueUsers.keys().

The thing is that I don't understand why this difference occurs. I always thought that if you do if object in dict, it simply creates a list of the dicts keys and tests agains that, but that's obviously not the case.

Can anybody explain how object in dict internally works and why it doesn't behave similar to object in list (as I would expect it to)?

Solution

In order to understand what’s going on, you have to understand how the in operator, the membership test, behaves for the different types.

For lists, this is pretty simple due to what lists fundamentally are: Ordered arrays that do not care about duplicates. The only possible way to peform a membership test here is to iterate over the list and check every item on equality. Something like this:

# x in lst
for item in lst:
    if x == item:
        return True
return False

Dictionaries are a bit different: They are hash tables were keys are meant to be unique. Hash tables require the keys to be hashable which essentially means that there needs to be an explicit function that converts the object into an integer. This hash value is then used to put the key/value mapping somewhere into the hash table.

Since the hash value determines where in the hash table an item is placed, it’s critical that objects which are meant to be identical produce the same hash value. So the following implication has to be true: x == y => hash(x) == hash(y). The reverse does not need to be true though; it’s perfectly valid to have different objects produce the same hash value.

When a membership test on a dictionary is performed, then the dictionary will first look for the hash value. If it can find it, then it will perform an equality check on all items it found; if it didn’t find the hash value, then it assumes that it’s a different object:

# x in dct
h = hash(x)
items = getItemsForHash(dct, h)
for item in items:
    if x == item:
        return True
# items is empty, or no match inside the loop
return False

Since you get the desired result when using a membership test against a list, that means that your object implements the equality comparison (__eq__) correctly. But since you do not get the correct result when using a dictionary, there seems to be a __hash__ implementation that is out of sync with the equality comparison implementation:

>>> class SomeType:
        def __init__ (self, x):
            self.x = x
        def __eq__ (self, other):
            return self.x == other.x
        def __hash__ (self):
            # bad hash implementation
            return hash(id(self))

>>> l = [SomeType(1)]
>>> d = { SomeType(1): 'x' }
>>> x = SomeType(1)
>>> x in l
True
>>> x in d
False

Note that for new-style classes in Python 2 (classes that inherit from object), this “bad hash implementation” (which is based on the object id) is the default. So when you do not implement your own __hash__ function, it still uses that one. This ultimately means that unless your __eq__ only performs an identity check (the default), the hash function will be out of sync.

So the solution is to implement __hash__ in a way that it aligns with the rules used in __eq__. For example, if you compare two members self.x and self.y, then you should use a compound hash over those two members. The easiest way to do that is to return the hash value of a tuple of those values:

class SomeType (object):
    def __init__ (self, x, y):
        self.x = x
        self.y = y

    def __eq__ (self, other):
        return self.x == other.x and self.y == other.y

    def __hash__ (self):
        return hash((self.x, self.y))

Note that you should not make an object hashable if it is mutable:

If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).