Efficiently test if an item is in a sorted list of strings

Suppose I have a list of short lowercase [a-z] strings (max length 8):

L = ['cat', 'cod', 'dog', 'cab', ...]

How to efficiently determine if a string s is in this list?

I know I can do if s in L: but I could presort L and binary-tree search.

I could even build my own tree, letter by letter. So setting s='cat': So T[ ord(s[0])-ord('a') ] gives the subtree leading to 'cat' and 'cab', etc. But eek, messy!

I could also make my own hashfunc, as L is static.

def hash_(item):
    w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
    return sum(w) % 123456

... and just fiddle the numbers until I don't get duplicates. Again, ugly.

Is there anything out-of-the-box I can use, or must I roll my own?

There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.

I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.

Solution

The builtin Python set is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).

So, convert your list into a set (preferably built once if multiple tests are to be made) and test for membership:

s in set(L)

Or, for multiple tests:

set_values = set(L)

# ...
if s in set_values:
    # ...

Here is a simple example to illustrate the performance:

from string import ascii_lowercase
import random

n = 1_000_000
L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]

Time to build the set:

%timeit set(L)
# 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Time to query against the set:

set_values = set(L)

# non-existent string
%timeit 'foo' in set_values
# 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# existing value
s = L[-1]

a = %timeit -o s in set_values
# 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Contrast that to testing directly against the list:

b = %timeit -o s in L
# 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

b.average / a.average
# 359141.74

When's the last time you made a 350,000x speedup ;-) ?