Assume I have a generator gen
that produces items, and another generator trans
that transforms the items and returns one output item per input item, and assume that both generators are expensive and I can't change either of them. Both generators may have additional arguments. The output of gen
is fed into trans
, but when looping over the results of trans
, I need the corresponding output of gen
as well. My current solution is to tee(gen())
and then zip
that with the output of trans
, and this works well, but my question is whether there is maybe a better solution that I am missing?
from itertools import tee
# these two generators are just an example, assume these are expensive and can't be changed
def gen():
yield from range(3)
def trans(inp):
for x in inp:
yield chr(x + ord("A"))
# my question is: is there a better way to achieve what the following two lines are doing?
g1, g2 = tee(gen())
for i, o in zip(g1, trans(g2)):
print(f"{i} -> {o}")
That's a good use case for tee
. The only downside I see is that it keeps the last up to 57 items in alive, even when you consume the tee
iterators in parallel. That's because tee
is rather speed-optimized, for inexpensive input iterators (see the discussion at the bottom of this answer).
If your items are large, then that can become a memory problem, and the memory problem can also become a speed problem (exhausting your cache or even having to swap to hard drive). Or it could be a problem if your objects are something like file handles that are kept open as long as they're alive.
You could avoid that by sending your input data through an extra generator that stores the current input item as a side effect. Two possible ways:
for o in trans(i := x for x in gen()):
print(f"{i} -> {o}")
def side_store():
global i
for i in gen():
yield i
for o in trans(side_store()):
print(f"{i} -> {o}")
Demo with an input generator yielding 1 MB large objects:
def gen():
for _ in range(100):
yield bytes(10**6)
def trans(inp):
for x in inp:
yield len(x)
Peak memory usage by your method and my two:
58,003,398 bytes original
2,000,494 bytes Stefan1
2,000,678 bytes Stefan2
Test script (Attempt This Online!):
from itertools import tee
import tracemalloc as tm
def original(gen, trans, handle):
g1, g2 = tee(gen())
for i, o in zip(g1, trans(g2)):
handle(i, o)
def Stefan1(gen, trans, handle):
for o in trans(i := x for x in gen()):
handle(i, o)
def Stefan2(gen, trans, handle):
i = None
def side_store():
nonlocal i
for i in gen():
yield i
for o in trans(side_store()):
handle(i, o)
funcs = original, Stefan1, Stefan2
print('Memory-testing with big items:')
def gen():
for _ in range(100):
yield bytes(10**6)
def trans(inp):
for x in inp:
yield len(x)
def handle(i, o):
pass
for f in funcs * 2:
tm.start()
f(gen, trans, handle)
memory = tm.get_traced_memory()[1]
tm.stop()
print(f'{memory:10,} bytes ', f.__name__)
print('\nTesting with your example generators:\n')
def gen():
yield from range(3)
def trans(inp):
for x in inp:
yield chr(x + ord("A"))
def handle(i, o):
print(f"{i} -> {o}")
for f in funcs:
print(f.__name__)
f(gen, trans, handle)
print()
The tee
source code discussing its data links holding 57 items (it holds the items in a linked list of "links" each holding not just one but 57 items):
/* The teedataobject pre-allocates space for LINKCELLS number of objects.
To help the object fit neatly inside cache lines (space for 16 to 32
pointers), the value should be a multiple of 16 minus space for
the other structure members including PyHEAD overhead. The larger the
value, the less memory overhead per object and the less time spent
allocating/deallocating new links. The smaller the number, the less
wasted space and the more rapid freeing of older data.
*/
#define LINKCELLS 57