Is it okay to use the yield statement in an instance method of a class? For example,
# Similar to itertools.islice
class Nth(object):
def __init__(self, n):
self.n = n
self.i = 0
self.nout = 0
def itervalues(self, x):
for xi in x:
self.i += 1
if self.i == self.n:
self.i = 0
self.nout += 1
yield self.nout, xi
Python doesn't complain about this, and simple cases seem to work. However, I've only seen examples with yield from regular functions.
I start having problems when I try to use it with itertools functions. For example, suppose I have two large data streams X and Y that are stored across multiple files, and I want to compute their sum and difference with only one loop through the data. I could use itertools.tee
and itertools.izip
like in the following diagram
In code it would be something like this (sorry, it's long)
from itertools import izip_longest, izip, tee
import random
def add(x,y):
for xi,yi in izip(x,y):
yield xi + yi
def sub(x,y):
for xi,yi in izip(x,y):
yield xi - yi
class NthSumDiff(object):
def __init__(self, n):
self.nthsum = Nth(n)
self.nthdiff = Nth(n)
def itervalues(self, x, y):
xadd, xsub = tee(x)
yadd, ysub = tee(y)
gen_sum = self.nthsum.itervalues(add(xadd, yadd))
gen_diff = self.nthdiff.itervalues(sub(xsub, ysub))
# Have to use izip_longest here, but why?
#for (i,nthsum), (j,nthdiff) in izip_longest(gen_sum, gen_diff):
for (i,nthsum), (j,nthdiff) in izip(gen_sum, gen_diff):
assert i==j, "sum row %d != diff row %d" % (i,j)
yield nthsum, nthdiff
nskip = 12
ns = Nth(nskip)
nd = Nth(nskip)
nsd = NthSumDiff(nskip)
nfiles = 10
for i in range(nfiles):
# Generate some data.
# If the block length is a multiple of nskip there's no problem.
#n = random.randint(5000, 10000) * nskip
n = random.randint(50000, 100000)
print 'file %d n=%d' % (i, n)
x = range(n)
y = range(100,n+100)
# Independent processing is no problem but requires two loops.
for i, nthsum in ns.itervalues(add(x,y)):
pass
for j, nthdiff in nd.itervalues(sub(x,y)):
pass
assert i==j
# Trying to do both with one loops causes problems.
for nthsum, nthdiff in nsd.itervalues(x,y):
# If izip_longest is necessary, why don't I ever get a fillvalue?
assert nthsum is not None
assert nthdiff is not None
# After each block of data the two iterators should have the same state.
assert nsd.nthsum.nout == nsd.nthdiff.nout, \
"sum nout %d != diff nout %d" % (nsd.nthsum.nout, nsd.nthdiff.nout)
But this fails unless I swap itertools.izip
out for itertools.izip_longest
even though the iterators have the same length. It's the last assert
that gets hit, with output like
file 0 n=58581
file 1 n=87978
Traceback (most recent call last):
File "test.py", line 71, in <module>
"sum nout %d != diff nout %d" % (nsd.nthsum.nout, nsd.nthdiff.nout)
AssertionError: sum nout 12213 != diff nout 12212
Edit: I guess it's not obvious from the example I wrote, but the input data X and Y are only available in blocks (in my real problem they're chunked in files). This is important because I need to maintain state between blocks. In the toy example above, this means Nth
needs to yield the equivalent of
>>> x1 = range(0,10)
>>> x2 = range(10,20)
>>> (x1 + x2)[::3]
[0, 3, 6, 9, 12, 15, 18]
NOT the equivalent of
>>> x1[::3] + x2[::3]
[0, 3, 6, 9, 10, 13, 16, 19]
I could use itertools.chain
to join the blocks ahead of time and then make one call to Nth.itervalues
, but I'd like to understand what's wrong with maintaining state in the Nth
class between calls (my real app is image processing involving more saved state, not simple Nth/add/subtract).
I don't understand how my Nth
instances end up in different states when their lengths are the same. For example, if I give izip
two strings of equal length
>>> [''.join(x) for x in izip('ABCD','abcd')]
['Aa', 'Bb', 'Cc', 'Dd']
I get a result of the same length; how come my Nth.itervalues
generators seem to be getting unequal numbers of next()
calls even though each one yields the same number of results?
Condensing the discussion, there's nothing wrong with using yield
in an instance method per se. You get into trouble with izip
if the instance state changes after the last yield
because izip
stops calling next()
on its arguments once any of them stops yielding results. A clearer example might be
from itertools import izip
class Three(object):
def __init__(self):
self.status = 'init'
def run(self):
self.status = 'running'
yield 1
yield 2
yield 3
self.status = 'done'
raise StopIteration()
it = Three()
for x in it.run():
assert it.status == 'running'
assert it.status == 'done'
it1, it2 = Three(), Three()
for x, y in izip(it1.run(), it2.run()):
pass
assert it1.status == 'done'
assert it2.status == 'done', "Expected status=done, got status=%s." % it2.status
which hits the last assertion,
AssertionError: Expected status=done, got status=running.
In the original question, the Nth
class can consume input data after its last yield
, so the sum and difference streams can get out of sync with izip
. Using izip_longest
would work since it will try to exhaust each iterator. A clearer solution might be to refactor to avoid changing state after the last yield.