Search code examples
pythondictionarymappingkey

Is there a way to provide sets or "order-insensitive tuples" as keys for a dict (or equivalent datstructure) in Python?


I'm looking for a pythonic way to map a given value to a tuple of keys in a data structure, so that datastructure[key1, key2] returns the same value as datastructure[key2, key1], without having to save the value twice for each mutation of the key tuple. Ideally, I'm looking for something like a dictionary with sets or "order-insensitive tuples" (I know, that tuples are by definition ordered) as keys.

My usecase for this is: I need to save similarity values of pairs of strings in an efficient way. Since sim(word1, word2) equals sim(word2, word1) (so ordering doesn't matter), I don't want to save the values for each tuple, because it would be redundant. Instead, I'd like to save one value for each word pair (the tuple key) and access it by either ordering of the words that the pair is made up.

Currently, I'm doing it like this and it's working fine:

d = {tuple(sorted(['w1', 'w2'])): 4}

print(d[tuple(sorted(['w1', 'w2']))])
print(d[tuple(sorted(['w2', 'w1']))])

>> 4
>> 4

But each update and access to the dictionary now requires a list initiation, sort action, and tuple transformation.

My question is therefore: Is there an elegant way to do, what I want, without having to use the above workaround or defining a custom class as key? Maybe there is a datastructure that allows exactly that and may even be faster because it is implemented on a lower level than my workaround. Ideally this would look something like this:

d = {('w1', 'w2'): 4}

print(d[('w1','w2')])
print(d[('w2','w1')])

>> 4
>> 4

Note: The above example doesn't work, because ('a2','b2') != ('b2','a2').

Here is the research I've done so far and an explanation of why these solutions aren't exactly what I'm looking for:

I'm thankful for any hints and make excuses in advance, if a duplicate Stackoverflow question exists and I overlooked it - it's hard to search for this, since I don't know how the desired feature would be called.


Solution

  • You could use a helper function that creates sorted tuples in a cheaper way:

    def key(s, t):
        return (s, t) if s < t else (t, s)
    
    d = {key('w1', 'w2'): 4}
    
    print(d[key('w1','w2')])
    print(d[key('w2','w1')])
    

    The function call overhead makes it slower than inlining the expression would be, but it's still faster than the alternatives mentioned so far.

    Here are various expressions and their times for evaluating them and using them to look up the dict value:

     132.7 ± 1.8 ns  (s, t) if s < t else (t, s)
     171.7 ± 1.5 ns  key(s, t)
     236.5 ± 0.9 ns  frozenset({s, t})
     254.9 ± 2.5 ns  frozenset((s, t))
     273.3 ± 2.0 ns  frozenset([s, t])
     343.2 ± 3.6 ns  tuple(sorted([s, t]))
     767.0 ± 9.7 ns  T(s, t)
    

    Benchmark code (Attempt This Online!):

    from timeit import timeit
    from statistics import mean, stdev
    
    setup = '''
    s, t = 'w1', 'w2'
    sts = [(s, t), (t, s)] * 5000
    
    def key(s, t):
        return (s, t) if s < t else (t, s)
    
    # from Anentropic's answer
    class T(tuple):
        def __new__(cls, *args):
            return super().__new__(cls, tuple(sorted(args)))
    
    d = {KEY: 4}
    '''
    
    funcs = [
        "tuple(sorted([s, t]))",
        "frozenset([s, t])",
        "frozenset((s, t))",
        "frozenset({s, t})",
        "key(s, t)",
        "(s, t) if s < t else (t, s)",
        "T(s, t)",
    ]
    
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e9 for t in sorted(times[f])[:10]]
        return f'{mean(ts):6.1f} ± {stdev(ts):3.1f} ns '
    
    for _ in range(100):
        for f in funcs:
            code = f'for s, t in sts: d[{f}]'
            setup_ = setup.replace('KEY', f)
            t = timeit(code, setup_, number=10**0) / 1e4
            times[f].append(t)
    
    for f in sorted(funcs, key=stats):
        print(stats(f), f)