The objective is to do calculations on a single iter
in parallel using builtin
sum & map
functions concurrently. Maybe using (something like) itertools
instead of classic for loops
to analyze (LARGE) data that arrives via an iterator
...
In one simple example case I want to calculate ilen, sum_x & sum_x_sq
:
ilen,sum_x,sum_x_sq=iterlen(iter),sum(iter),sum(map(lambda x:x*x, iter))
But without converting the (large) iter
to a list
(as with iter=list(iter)
)
n.b. Do this using sum & map
and without for loops
, maybe using the itertools
and/or threading
modules?
def example_large_data(n=100000000, mean=0, std_dev=1):
for i in range(n): yield random.gauss(mean,std_dev)
-- edit --
Being VERY specific: I was taking a good look at itertools
hoping that there was a dual function like map
that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,iterlen,sum,sum_sq)
If I was to be very very specific: I am looking for just one answer, python source code for the "iterfork
" procedure.
You can use itertools.tee
to turn your single iterator into three iterators which you can pass to your three functions.
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
ilen, sum_x, sum_x_sq = count(iter0),sum(iter1),sum(map(lambda x:x*x, iter2))
That will work, but the builtin function sum
(and map
in Python 2) is not implemented in a way that supports parallel iteration. The first function you call will consume its iterator completely, then the second one will consume the second iterator, then the third function will consume the third iterator. Since tee
has to store the values seen by one of its output iterators but not all of the others, this is essentially the same as creating a list from the iterator and passing it to each function.
Now, if you use generator functions that consume only a single value from their input for each value they output, you might be able to make parallel iteration work using zip
. In Python 3, map
and zip
are both generators. The question is how to make sum
into a generator.
I think you can get pretty much what you want by using itertools.accumulate
(which was added in Python 3.2). It is a generator that yields a running sum of its input. Here's how you could make it work for your problem (I'm assuming your count
function was supposed to be an iterator-friendly version of len
):
iter0, iter1, iter2 = itertools.tee(input_iter, 3)
len_gen = itertools.accumulate(map(lambda x: 1, iter0))
sum_gen = itertools.accumulate(iter1)
sum_sq_gen = itertools.accumulate(map(lambda x: x*x, iter2))
parallel_gen = zip(len_gen, sum_gen, sum_sq_gen) # zip is a generator in Python 3
for ilen, sum_x, sum_x_sq in parallel_gen:
pass # the generators do all the work, so there's nothing for us to do here
# ilen_x, sum_x, sum_x_sq have the right values here!
If you're using Python 2, rather than 3, you'll have to write your own accumulate
generator function (there's a pure Python implementation in the docs I linked above), and use itertools.imap
and itertools.izip
rather than the builtin map
and zip
functions.