Search code examples
pythonlistperformancesortedlist

Efficiently test if an item is in a sorted list of strings


Suppose I have a list of short lowercase [a-z] strings (max length 8):

L = ['cat', 'cod', 'dog', 'cab', ...]

How to efficiently determine if a string s is in this list?

I know I can do if s in L: but I could presort L and binary-tree search.

I could even build my own tree, letter by letter. So setting s='cat': So T[ ord(s[0])-ord('a') ] gives the subtree leading to 'cat' and 'cab', etc. But eek, messy!

I could also make my own hashfunc, as L is static.

def hash_(item):
    w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
    return sum(w) % 123456

... and just fiddle the numbers until I don't get duplicates. Again, ugly.

Is there anything out-of-the-box I can use, or must I roll my own?

There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.

I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.


Solution

  • The builtin Python set is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).

    So, convert your list into a set (preferably built once if multiple tests are to be made) and test for membership:

    s in set(L)
    

    Or, for multiple tests:

    set_values = set(L)
    
    # ...
    if s in set_values:
        # ...
    

    Here is a simple example to illustrate the performance:

    from string import ascii_lowercase
    import random
    
    n = 1_000_000
    L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]
    

    Time to build the set:

    %timeit set(L)
    # 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Time to query against the set:

    set_values = set(L)
    
    # non-existent string
    %timeit 'foo' in set_values
    # 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    
    # existing value
    s = L[-1]
    
    a = %timeit -o s in set_values
    # 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    

    Contrast that to testing directly against the list:

    b = %timeit -o s in L
    # 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    b.average / a.average
    # 359141.74
    

    When's the last time you made a 350,000x speedup ;-) ?