Search code examples
pythonpython-itertools

Can I yield from an instance method


Is it okay to use the yield statement in an instance method of a class? For example,

# Similar to itertools.islice
class Nth(object):
    def __init__(self, n):
        self.n = n
        self.i = 0
        self.nout = 0

    def itervalues(self, x):
        for xi in x:
            self.i += 1
            if self.i == self.n:
                self.i = 0
                self.nout += 1
                yield self.nout, xi

Python doesn't complain about this, and simple cases seem to work. However, I've only seen examples with yield from regular functions.

I start having problems when I try to use it with itertools functions. For example, suppose I have two large data streams X and Y that are stored across multiple files, and I want to compute their sum and difference with only one loop through the data. I could use itertools.tee and itertools.izip like in the following diagram

data flow

In code it would be something like this (sorry, it's long)

from itertools import izip_longest, izip, tee
import random

def add(x,y):
    for xi,yi in izip(x,y):
        yield xi + yi

def sub(x,y):
    for xi,yi in izip(x,y):
        yield xi - yi

class NthSumDiff(object):
    def __init__(self, n):
        self.nthsum = Nth(n)
        self.nthdiff = Nth(n)

    def itervalues(self, x, y):
        xadd, xsub = tee(x)
        yadd, ysub = tee(y)
        gen_sum = self.nthsum.itervalues(add(xadd, yadd))
        gen_diff = self.nthdiff.itervalues(sub(xsub, ysub))
        # Have to use izip_longest here, but why?
        #for (i,nthsum), (j,nthdiff) in izip_longest(gen_sum, gen_diff):
        for (i,nthsum), (j,nthdiff) in izip(gen_sum, gen_diff):
            assert i==j, "sum row %d != diff row %d" % (i,j)
            yield nthsum, nthdiff

nskip = 12
ns = Nth(nskip)
nd = Nth(nskip)
nsd = NthSumDiff(nskip)
nfiles = 10
for i in range(nfiles):
    # Generate some data.
    # If the block length is a multiple of nskip there's no problem.
    #n = random.randint(5000, 10000) * nskip
    n = random.randint(50000, 100000)
    print 'file %d n=%d' % (i, n)
    x = range(n)
    y = range(100,n+100)
    # Independent processing is no problem but requires two loops.
    for i, nthsum in ns.itervalues(add(x,y)):
        pass
    for j, nthdiff in nd.itervalues(sub(x,y)):
        pass
    assert i==j
    # Trying to do both with one loops causes problems.
    for nthsum, nthdiff in nsd.itervalues(x,y):
        # If izip_longest is necessary, why don't I ever get a fillvalue?
        assert nthsum is not None
        assert nthdiff is not None
    # After each block of data the two iterators should have the same state.
    assert nsd.nthsum.nout == nsd.nthdiff.nout, \
           "sum nout %d != diff nout %d" % (nsd.nthsum.nout, nsd.nthdiff.nout)

But this fails unless I swap itertools.izip out for itertools.izip_longest even though the iterators have the same length. It's the last assert that gets hit, with output like

file 0 n=58581
file 1 n=87978
Traceback (most recent call last):
  File "test.py", line 71, in <module>
    "sum nout %d != diff nout %d" % (nsd.nthsum.nout, nsd.nthdiff.nout)
AssertionError: sum nout 12213 != diff nout 12212 

Edit: I guess it's not obvious from the example I wrote, but the input data X and Y are only available in blocks (in my real problem they're chunked in files). This is important because I need to maintain state between blocks. In the toy example above, this means Nth needs to yield the equivalent of

>>> x1 = range(0,10)
>>> x2 = range(10,20)
>>> (x1 + x2)[::3]
[0, 3, 6, 9, 12, 15, 18]

NOT the equivalent of

>>> x1[::3] + x2[::3]
[0, 3, 6, 9, 10, 13, 16, 19]

I could use itertools.chain to join the blocks ahead of time and then make one call to Nth.itervalues, but I'd like to understand what's wrong with maintaining state in the Nth class between calls (my real app is image processing involving more saved state, not simple Nth/add/subtract).

I don't understand how my Nth instances end up in different states when their lengths are the same. For example, if I give izip two strings of equal length

>>> [''.join(x) for x in izip('ABCD','abcd')]
['Aa', 'Bb', 'Cc', 'Dd']

I get a result of the same length; how come my Nth.itervalues generators seem to be getting unequal numbers of next() calls even though each one yields the same number of results?


Solution

  • Condensing the discussion, there's nothing wrong with using yield in an instance method per se. You get into trouble with izip if the instance state changes after the last yield because izip stops calling next() on its arguments once any of them stops yielding results. A clearer example might be

    from itertools import izip
    
    class Three(object):
        def __init__(self):
            self.status = 'init'
    
        def run(self):
            self.status = 'running'
            yield 1
            yield 2
            yield 3
            self.status = 'done'
            raise StopIteration()
    
    it = Three()
    for x in it.run():
        assert it.status == 'running'
    assert it.status == 'done'
    
    it1, it2 = Three(), Three()
    for x, y in izip(it1.run(), it2.run()):
        pass
    assert it1.status == 'done'
    assert it2.status == 'done', "Expected status=done, got status=%s." % it2.status
    

    which hits the last assertion,

    AssertionError: Expected status=done, got status=running.
    

    In the original question, the Nth class can consume input data after its last yield, so the sum and difference streams can get out of sync with izip. Using izip_longest would work since it will try to exhaust each iterator. A clearer solution might be to refactor to avoid changing state after the last yield.