Suppose I have a list of short lowercase [a-z] strings (max length 8):
L = ['cat', 'cod', 'dog', 'cab', ...]
How to efficiently determine if a string s
is in this list?
I know I can do if s in L:
but I could presort L
and binary-tree search.
I could even build my own tree, letter by letter. So setting s='cat'
:
So T[ ord(s[0])-ord('a') ]
gives the subtree leading to 'cat'
and 'cab'
, etc. But eek, messy!
I could also make my own hashfunc, as L is static.
def hash_(item):
w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
return sum(w) % 123456
... and just fiddle the numbers until I don't get duplicates. Again, ugly.
Is there anything out-of-the-box I can use, or must I roll my own?
There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.
I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.
The builtin Python set
is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).
So, convert your list
into a set
(preferably built once if multiple tests are to be made) and test for membership:
s in set(L)
Or, for multiple tests:
set_values = set(L)
# ...
if s in set_values:
# ...
Here is a simple example to illustrate the performance:
from string import ascii_lowercase
import random
n = 1_000_000
L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]
Time to build the set:
%timeit set(L)
# 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Time to query against the set:
set_values = set(L)
# non-existent string
%timeit 'foo' in set_values
# 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# existing value
s = L[-1]
a = %timeit -o s in set_values
# 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Contrast that to testing directly against the list
:
b = %timeit -o s in L
# 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
b.average / a.average
# 359141.74
When's the last time you made a 350,000x speedup ;-) ?