Search code examples
pythonlistdictionarystringification

How to use a list of strings [which may contain any character] as keys?


Basically I "understand" the ambiguity with using other containers as keys. - do you compare the reference or do you compare the values in it (and how deep).

Well can't you just specialize a list and make a custom comparison/hash operator? As I know in my application I wish to compare the list-of-strings by the value of the strings (and relative ordering of course).

So how would I go about writing a custom hash for these kind of lists? Or in another way - how would I stringify the list without leading to the ambiguity that you get from introducing delimiters (that delimiter might be part of the string)?!

Regarding this question: https://wiki.python.org/moin/DictionaryKeys
There it is said directly that lists can't be used;

That said, the simple answer to why lists cannot be used as dictionary keys is that lists do not provide a valid hash method. Of course, the obvious question is, "Why not?"

So I am writing this to question if there's a way to make lists hashable; and how I would make a satisfactory Hash method.


As an example of why I would want this, see the code below:

namelist = sorted(["Jan", "Frank", "Kim"])
commonTrait = newTraits()
traitdict = {}
traitdict[namelist] = commonTrait;

//later I wish to retrieve it:
trait = traitdict[sorted(["Jan", "Frank", "Kim"])]

In this direct use I have in mind the "ordering of the list" doesn't really matter (sorting is just done in above code to make the lists always equal if they hold the same content).


Solution

  • If you need to use a collection of strings as a dictionary key, you have 2 obvious choices: if the order matters, use tuple:

    >>> d[tuple(['foo', 'bar'])] = 'baz'
    >>> d['foo', 'bar']
    baz
    >>> d['bar', 'foo']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: ('bar', 'foo')
    

    If the order must not matter, use frozenset:

    >>> d[frozenset(['foo', 'bar'])] = 'baz'
    >>> d[frozenset(['bar', 'foo'])]
    'baz'
    >>> d[frozenset(['bar', 'foo', 'bar'])]
    'baz'
    

    If the count matters but ordering does not, use sorted with a tuple:

    >>> d[tuple(sorted(['foo', 'bar']))] = 'baz'
    >>> d[tuple(sorted(['bar', 'foo']))]
    'baz'
    >>> d[tuple(sorted(['bar', 'foo', 'bar']))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: ('bar', 'bar', 'foo')
    

    Unlike with Perl hashes, or JavaScript object properties, you do not need to stringify dictionary keys in Python.


    Now, concerning the mutable list not being hashable: The Python dictionary implementation uses the hashtable structure. It specifically requires and assumes of the key that:

    1. the key implements the __hash__ method that returns an integer
    2. that the integers should be as widely spread in the output range, and uniformly mapped, as possible.
    3. that __hash__ method returns an unchanging number for the same object
    4. a == b implies that a.__hash__() == b.__hash__()

    A list is not usable as a dictionary key, as:

    >>> [].__hash__()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'NoneType' object is not callable
    

    The list class cannot provide a __hash__ method that could satisfy all the requirements simultaneously a == b needs to imply that a.__hash__() == b.__hash__().

    (It *could provide an implementation that returns 0 for each list, and it would work correctly then, but it would refute the use of hashing altogether, as all lists would be mapped to the same slot in the dictionary, as the hash codes would break the rule 2).

    It is also not possible to create a __hash__ method for a list:

    >>> [].__hash__ = lambda x: 0
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'list' object attribute '__hash__' is read-only
    

    Of course we can always see what would happen if list would have the __hash__ method - we create a subclass of list and provide __hash__ there; the obvious implementation for the hash code would be that of the tuple():

    >>> class hashablelist(list):
    ...     def __hash__(self):
    ...         return hash(tuple(self))
    ... 
    >>> x = hashablelist(['a', 'b', 'c'])
    >>> y = hashablelist(['a', 'b', 'd'])
    >>> d = {}
    >>> d[x] = 'foo'
    >>> d[y] = 'bar'
    >>> d.items()
    [(['a', 'b', 'c'], 'foo'), (['a', 'b', 'd'], 'bar')]
    >>> y[2] = 'c'
    >>> d.items()
    [(['a', 'b', 'c'], 'foo'), (['a', 'b', 'c'], 'bar')]
    >>> del d[x]
    >>> del d[y]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: ['a', 'b', 'c']
    >>> d.items()
    [(['a', 'b', 'c'], 'bar')]
    >>> x in d
    False
    >>> y in d
    False
    >>> x in d.keys()
    True
    >>> y in d.keys()
    True
    

    The code shows that we just managed to get a broken dictionary as a result - there is no way of accessing or removing the ['a', 'b', 'c'] -> 'bar' pair directly by the key, even though it is visible in .keys(), .values() and .items().