Search code examples
pythonpython-3.xgenerator

Keeping the input provided to a generator


Assume I have a generator gen that produces items, and another generator trans that transforms the items and returns one output item per input item, and assume that both generators are expensive and I can't change either of them. Both generators may have additional arguments. The output of gen is fed into trans, but when looping over the results of trans, I need the corresponding output of gen as well. My current solution is to tee(gen()) and then zip that with the output of trans, and this works well, but my question is whether there is maybe a better solution that I am missing?

from itertools import tee

# these two generators are just an example, assume these are expensive and can't be changed
def gen():
    yield from range(3)
def trans(inp):
    for x in inp:
        yield chr(x + ord("A"))

# my question is: is there a better way to achieve what the following two lines are doing?
g1, g2 = tee(gen())
for i, o in zip(g1, trans(g2)):
    print(f"{i} -> {o}")

Solution

  • That's a good use case for tee. The only downside I see is that it keeps the last up to 57 items in alive, even when you consume the tee iterators in parallel. That's because tee is rather speed-optimized, for inexpensive input iterators (see the discussion at the bottom of this answer).

    If your items are large, then that can become a memory problem, and the memory problem can also become a speed problem (exhausting your cache or even having to swap to hard drive). Or it could be a problem if your objects are something like file handles that are kept open as long as they're alive.

    You could avoid that by sending your input data through an extra generator that stores the current input item as a side effect. Two possible ways:

    for o in trans(i := x for x in gen()):
        print(f"{i} -> {o}")
    
    def side_store():
        global i
        for i in gen():
            yield i
    for o in trans(side_store()):
        print(f"{i} -> {o}")
    

    Demo with an input generator yielding 1 MB large objects:

    def gen():
        for _ in range(100):
            yield bytes(10**6)
    
    def trans(inp):
        for x in inp:
            yield len(x)
    

    Peak memory usage by your method and my two:

    58,003,398 bytes  original
     2,000,494 bytes  Stefan1
     2,000,678 bytes  Stefan2
    

    Test script (Attempt This Online!):

    from itertools import tee
    import tracemalloc as tm
    
    
    def original(gen, trans, handle):
        g1, g2 = tee(gen())
        for i, o in zip(g1, trans(g2)):
            handle(i, o)
    
    def Stefan1(gen, trans, handle):
        for o in trans(i := x for x in gen()):
            handle(i, o)
    
    def Stefan2(gen, trans, handle):
        i = None
        def side_store():
            nonlocal i
            for i in gen():
                yield i
        for o in trans(side_store()):
            handle(i, o)
    
    funcs = original, Stefan1, Stefan2
    
    
    print('Memory-testing with big items:')
    
    def gen():
        for _ in range(100):
            yield bytes(10**6)
    def trans(inp):
        for x in inp:
            yield len(x)
    def handle(i, o):
        pass
    for f in funcs * 2:
        tm.start()
        f(gen, trans, handle)
        memory = tm.get_traced_memory()[1]
        tm.stop()
        print(f'{memory:10,} bytes ', f.__name__)
    
    
    print('\nTesting with your example generators:\n')
    
    def gen():
        yield from range(3)
    def trans(inp):
        for x in inp:
            yield chr(x + ord("A"))
    def handle(i, o):
        print(f"{i} -> {o}")
    for f in funcs:
        print(f.__name__)
        f(gen, trans, handle)
        print()
    

    The tee source code discussing its data links holding 57 items (it holds the items in a linked list of "links" each holding not just one but 57 items):

    /* The teedataobject pre-allocates space for LINKCELLS number of objects.
       To help the object fit neatly inside cache lines (space for 16 to 32
       pointers), the value should be a multiple of 16 minus  space for
       the other structure members including PyHEAD overhead.  The larger the
       value, the less memory overhead per object and the less time spent
       allocating/deallocating new links.  The smaller the number, the less
       wasted space and the more rapid freeing of older data.
    */
    #define LINKCELLS 57