Search code examples
pythonmemorygeneratorcpythonpypy

Chaining generators considered harmful?


I claim: Chaining generators in Python is memory-inefficient and renders them unusable for certain types of applications. If possible, please prove me wrong.

First, a very simple and straight-forward example without generators:

import gc

def cocktail_objects():
    # find all Cocktail objects currently tracked by the garbage collector
    return filter(lambda obj: isinstance(obj, Cocktail), gc.get_objects())

class Cocktail(object):
    def __init__(self, ingredients):
        # ingredients represents our object data, imagine some heavy arrays
        self.ingredients = ingredients
    def __str__(self):
        return self.ingredients
    def __repr__(self):
        return 'Cocktail(' + str(self) + ')'

def create(first_ingredient):
    return Cocktail(first_ingredient)

def with_ingredient(cocktail, ingredient):
    # this could be some data transformation function
    return Cocktail(cocktail.ingredients + ' and ' + ingredient)

first_ingredients = ['rum', 'vodka']

print 'using iterative style:' 
for ingredient in first_ingredients:
    cocktail = create(ingredient)
    cocktail = with_ingredient(cocktail, 'coke')
    cocktail = with_ingredient(cocktail, 'limes')
    print cocktail
    print cocktail_objects()

This prints as expected:

rum and coke and limes
[Cocktail(rum and coke and limes)]
vodka and coke and limes
[Cocktail(vodka and coke and limes)]

Now let's use iterator objects to make the cocktail transformation easier composable:

class create_iter(object):
    def __init__(self, first_ingredients):
        self.first_ingredients = first_ingredients
        self.i = 0

    def __iter__(self):
        return self

    def next(self):
        try:
            ingredient = self.first_ingredients[self.i]
        except IndexError:
            raise StopIteration
        else:
            self.i += 1
            return create(ingredient)

class with_ingredient_iter(object):
    def __init__(self, cocktails_iter, ingredient):
        self.cocktails_iter = cocktails_iter
        self.ingredient = ingredient

    def __iter__(self):
        return self

    def next(self):
        cocktail = next(self.cocktails_iter)
        return with_ingredient(cocktail, self.ingredient)

print 'using iterators:'
base = create_iter(first_ingredients)
with_coke = with_ingredient_iter(base, 'coke')
with_coke_and_limes = with_ingredient_iter(with_coke, 'limes')
for cocktail in with_coke_and_limes:
    print cocktail
    print cocktail_objects() 

The output is identical to before.

Finally, let's replace iterators with generators to get rid of boiler-plate:

def create_gen(first_ingredients):
    for ingredient in first_ingredients:
        yield create(ingredient)

def with_ingredient_gen(cocktails_gen, ingredient):
    for cocktail in cocktails_gen:
        yield with_ingredient(cocktail, ingredient)

print 'using generators:'
base = create_gen(first_ingredients)
with_coke = with_ingredient_gen(base, 'coke')
with_coke_and_limes = with_ingredient_gen(with_coke, 'limes')

for cocktail in with_coke_and_limes:
    print cocktail
    print cocktail_objects()

This however prints:

rum and coke and limes
[Cocktail(rum), Cocktail(rum and coke), Cocktail(rum and coke and limes)]
vodka and coke and limes
[Cocktail(vodka), Cocktail(vodka and coke), Cocktail(vodka and coke and limes)]

This means that in a chain of generators, all currently yielded objects in that chain stay in memory and don't get released, even though the ones in earlier chain positions aren't needed anymore. Result: higher than necessary memory consumption.

Now, the question is: Why do the generators hold on to the objects that they are yielding until the next iteration starts? Obviously the objects aren't needed anymore in the generators and the references to them could be released.

I'm using generators in one of my projects to transform heavy data (numpy arrays of hundreds of megabytes) in a kind of pipeline. But as you can see this is very inefficient memory-wise. I am using Python 2.7. If this is a behaviour that got fixed in Python 3, please tell me. Otherwise, does this qualify for a bug report? And most importantly, are there any work-arounds except rewriting as shown?


Work-around 1:

print 'using imap:'
from itertools import imap
base = imap(lambda ingredient: create(ingredient), first_ingredients)
with_coke = imap(lambda cocktail: with_ingredient(cocktail, 'coke'), base)
with_coke_and_limes = imap(lambda cocktail: with_ingredient(cocktail, 'limes'), with_coke)

for cocktail in with_coke_and_limes:
    print cocktail
    print gc.collect()
    print cocktail_objects()

Obviously this would be only useable if no state needs to be kept between the "yields". In the examples this is the case.

Preliminary conclusion: If you use iterator classes, then you decide what state you want to keep. If you use generators, Python implicitly decides what state to keep. If you use itertools.imap you cannot keep any state.


Solution

  • Your with_coke_and_limes yields at a certain point in its execution. At that point, the function has a local variable called cocktail (from its for loop) that refers to the "intermediate" cocktail from the next step up in the generator nesting (i.e., "rum and coke"). Just because the generator yields at that point does not mean it can throw away that object. The execution of with_ingredient_gen is suspended at that point, and at that point the local variable cocktail still exists. The function might need to refer to it later, after it resumes. There's nothing that says the yield has to be the last thing in your for loop, or that there has to be only one yield. You could have written with_ingredient_gen like this:

    def with_ingredient_gen(cocktails_gen, ingredient):
        for cocktail in cocktails_gen:
            yield with_ingredient(cocktail, ingredient)
            yield with_ingredient(cocktail, "another ingredient")
    

    If Python threw away cocktail after the first yield, what would it do when it resumed the generator on the next iteration and found it needed that cocktail object again for the second yield?

    The same applies to the other generators in the chain. Once you advance with_coke_and_limes to create a cocktail, with_coke and base are also activated and then suspended, and they have local variables referring to their own intermediate cocktails. Just as described above, these functions cannot delete the objects they refer to, since they might need them after resuming.

    The generator function has to have some sort of reference to an object in order to yield it. And it has to keep that reference after it yields it, because it is suspended immediately after it yields, but it can't know whether it will need the reference once it is resumed.

    Note that the only reason you didn't see the intermediate objects in your first example is because you overwrote the same local variable with each successive cocktail, allowing the earlier cocktail objects to be released. If in your first code snippet you do this instead:

    for ingredient in first_ingredients:
        cocktail = create(ingredient)
        cocktail2 = with_ingredient(cocktail, 'coke')
        cocktail3 = with_ingredient(cocktail, 'limes')
        print cocktail3
        print cocktail_objects()
    

    ...then you will see all three intermediate cocktails printed in that case as well, because each now has a separate local variable referring to it. Your generator version splits each of these intermediate variables into separate functions, so you can't overwrite the "parent" cocktail with the "derived" cocktail.

    You are right that this can cause a problem if you have a deeply nested sequence of generators, each one of which creates large objects in memory and stores them in local variables. However, that is not a common situation. In such a situation, you have a couple of options. One is to perform the operations in a "flat" iterative style as in your first example.

    Another option is to write your intermediate generators so that they don't actually create the large objects, but only "stack" the information needed to do so. For instance, in your example, if you don't want the intermediate Cocktail objects, don't create them. Instead of having each generator create a Cocktail and then having the next generator extract the previous cocktail's ingredients, have the generators pass on just the ingredients and have one final generator that combines the stacked ingredients and creates just one cocktail at the end.

    It's hard to say exactly how to do this for your real application, but it may be possible. For instance, if your generators working on numpy arrays are doing things like add this, subtract that, transpose, etc., you can pass on "deltas" that describe what to do without actually doing it. Instead of having an intermediate generator, say, multiply an array by 3 and yield the array, have it yield some kind of indicator like "*3" (or possibly even a function doing the multiplication). Then your last generator can iterate over these "instructions" and perform the operations all in one place.