I want to know how to do multi-index lookup for a sequence containing a large number (thousands and up) of elements, where two levels of indexes are used, one of the indexes are the elements themselves, the other index is the same in each lookup.
Before I explain anything further, I would like to clarify that this is not homework or job-related, I am an unemployed programming enthusiast, and I only state this because some people might assume this is homework.
The elements can be of any hashable type, not just int
s, they can be str
s for instance, and they might not be present in the lookup table, the second index is a int
and the same in each batch lookup. The desired behavior is, if the element is in the lookup table, output the element located at the second index in the collection associated with the element in the lookup table, else output the element itself.
I want to know how to scale it up so it can handle gigantic inputs.
I have created a script to rotate basic Latin letters in a string by n (0 <= n <= 25, n is an integer) to illustrate the point, if a character in the string is one of the 52 Latin letters, it is replaced by the letter located at n indices from the character in the alphabet, with wrap-around and case preserving, else the character is left as is.
The code is only used as a minimal reproducible example, it is not my goal, and I have measured the execution time of the statements and provided the timing in comments:
import random
from typing import Mapping, Sequence
from itertools import product
LETTERS = [
{chr(a + b): chr(a + (b + i) % 26) for a, b in product((65, 97), range(26))}
for i in range(26)
]
# 365 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
UPPER = [chr(65 + i) for i in range(26)]
# 2.43 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
LOWER = [chr(97 + i) for i in range(26)]
# 2.68 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
ROT = {c: case[i:] + case[:i] for case in (LOWER, UPPER) for i, c in enumerate(case)}
# 24.9 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
ROT_LIST = [{} for _ in range(26)]
# 1.58 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
for k, v in ROT.items():
for i, c in enumerate(v):
ROT_LIST[i][k] = c
# 135 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# (2.43 + 2.68 + 24.9 + 1.58) == 31.59
# (31.59 + 135) == 166.59
# 31.59 / 365 == 0.08654794520547945
# 166.59 / 365 == 0.4564109589041096
def rot(s: str, d: int = 13) -> str:
return "".join(LETTERS[d].get(c, c) for c in s)
def rot_v1(s: str, d: int = 13) -> str:
return "".join(c if (r := ROT.get(c, c)) == c else r[d] for c in s)
def rot_v2(s: str, d: int = 13) -> str:
return "".join(ROT[c][d] if c in ROT else c for c in s)
def get_size(obj: object) -> int:
size = obj.__sizeof__()
if isinstance(obj, Mapping):
size += sum(get_size(k) + get_size(v) for k, v in obj.items())
elif isinstance(obj, Sequence) and not isinstance(obj, str):
size += sum(get_size(e) for e in obj)
return size
# get_size(LETTERS) == 194152
# get_size(ROT) == 85352
# 85352 / 194152 == 0.439614322798632
string = random.choices(LOWER, k=4096)
LETTERS
is the most straightforward way I can think of to generate the lookup table, but it is both memory-inefficient and time-inefficient. ROT
is both time-efficient and memory-efficient, it reduces memory usage by 56.04% and execution time by 91.35%, though I guess querying it might be slower than LETTERS
.
I have also found that generating a structure like LETTERS
from ROT
takes 54.36% less time than computing it directly.
I have performed multiple tests and the timings vary widely, but it seems that rot_v2
is consistently faster than rot
, which in turn is consistently faster than rot_v1
:
In [9]: %timeit rot(string)
742 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: %timeit rot_v1(string)
771 µs ± 8.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [11]: %timeit rot_v2(string)
569 µs ± 6.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [12]: %timeit rot_v2(string)
588 µs ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]: %timeit rot_v1(string)
821 µs ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [14]: %timeit rot_v2(string)
609 µs ± 53.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [15]: %timeit rot(string)
601 µs ± 50.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [16]: %timeit rot(string)
603 µs ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [17]: %timeit rot_v2(string)
586 µs ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [18]: %timeit rot_v1(string)
818 µs ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I have considered using NumPy
but so far I have only used booleans and integers as array indices and I don't know how to use string array as indices.
How can I speed it up?
Clarification: rot
is not what I actually want, it is just an MRE of a more general multi-index lookup where some keys might be missing.
Mapping dict.get
:
def rot_Kelly(s: str, d: int = 13) -> str:
return "".join(map(LETTERS[d].get, s, s))
Benchmark results:
176.7 ± 0.8 μs rot_Kelly
450.2 ± 2.1 μs rot_v2
474.3 ± 1.7 μs rot
616.8 ± 3.4 μs rot_v1
Benchmark code (add it under all your code):
from timeit import timeit
from statistics import mean, stdev
def rot_Kelly(s: str, d: int = 13) -> str:
return "".join(map(LETTERS[d].get, s, s))
funcs = rot, rot_v1, rot_v2, rot_Kelly
times = {f: [] for f in funcs}
def stats(f):
ts = [t * 1e6 for t in sorted(times[f])[:10]]
return f'{mean(ts):6.1f} ± {stdev(ts):3.1f} μs '
for _ in range(100):
for f in funcs:
t = timeit(lambda: f(string), number=10) / 10
times[f].append(t)
for f in sorted(funcs, key=stats):
print(stats(f), f.__name__)