Search code examples
pythonperformancedictionarylookupstartswith

More efficient way to look up dictionary values whose keys start with same prefix


I have a dictionary whose keys come in sets that share the same prefix, like this:

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

Given a query string, I need to look up all the values associated with keys that start with that prefix, e.g. for query="key1" I need to get ["valA", "valB", "valC"]

My implementation below works but is too slow for a large number of queries since the dictionary d has about 30,000 keys and most of the keys are more than 20 characters long:

result = [d[s] for s in d.keys() if s.startswith(query)]

Is there a faster/more efficient way to implement this?


Solution

  • You can avoid producing the intermediate list generated by dict.keys() (in python 2.x):

    result = [d[key] for key in d if key.startswith(query)]
    

    But you most likely want to use a trie instead of a dictionary, so you can find all the values associated with a key with a common prefix (a trie is similar to a tree based on prefixes).

    Here you can find some different implementation of tries.

    A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

    A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn". (source wikipedia)


    Let's compare the timings for the different solutions:

    # create a dictionary with 30k entries
    d = {str(x):str(x) for x in xrange(1, 30001)}
    query = '108'
    
    # dict with keys()
    %timeit [d[s] for s in d.keys() if s.startswith(query)]
    
        100 loops, best of 3: 8.87 ms per loop
    

    # dict without keys()
    %timeit [d[s] for s in d if s.startswith(query)]
    
        100 loops, best of 3: 7.83 ms per loop
    
    # 11.72% improvement
    

    # PyTrie (https://pypi.python.org/pypi/PyTrie/0.2)
    import pytrie
    pt = pytrie.Trie(d)
    
    %timeit [pt[s] for s in pt.iterkeys(query)]
    
        1000 loops, best of 3: 320 µs per loop
    
    # 96.36% improvement
    

    # datrie (https://pypi.python.org/pypi/datrie/0.7)
    import datrie
    dt = datrie.Trie('0123456789')
    for key, val in d.iteritems():
        dt[unicode(key)] = val
    
    %timeit [dt[s] for s in dt.keys(unicode(query))]
    
        10000 loops, best of 3: 162 µs per loop
    
    # 98.17% improvement