Search code examples
pythonparallel-processingiteratorpython-itertools

iterating over a single list in parallel in python


The objective is to do calculations on a single iter in parallel using builtin sum & map functions concurrently. Maybe using (something like) itertools instead of classic for loops to analyze (LARGE) data that arrives via an iterator...

In one simple example case I want to calculate ilen, sum_x & sum_x_sq:

ilen,sum_x,sum_x_sq=iterlen(iter),sum(iter),sum(map(lambda x:x*x, iter))

But without converting the (large) iter to a list (as with iter=list(iter))

n.b. Do this using sum & map and without for loops, maybe using the itertools and/or threading modules?

def example_large_data(n=100000000, mean=0, std_dev=1):
  for i in range(n): yield random.gauss(mean,std_dev)

-- edit --

Being VERY specific: I was taking a good look at itertools hoping that there was a dual function like map that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,iterlen,sum,sum_sq)

If I was to be very very specific: I am looking for just one answer, python source code for the "iterfork" procedure.


Solution

  • You can use itertools.tee to turn your single iterator into three iterators which you can pass to your three functions.

    iter0, iter1, iter2 = itertools.tee(input_iter, 3)
    ilen, sum_x, sum_x_sq = count(iter0),sum(iter1),sum(map(lambda x:x*x, iter2))
    

    That will work, but the builtin function sum (and map in Python 2) is not implemented in a way that supports parallel iteration. The first function you call will consume its iterator completely, then the second one will consume the second iterator, then the third function will consume the third iterator. Since tee has to store the values seen by one of its output iterators but not all of the others, this is essentially the same as creating a list from the iterator and passing it to each function.

    Now, if you use generator functions that consume only a single value from their input for each value they output, you might be able to make parallel iteration work using zip. In Python 3, map and zip are both generators. The question is how to make sum into a generator.

    I think you can get pretty much what you want by using itertools.accumulate (which was added in Python 3.2). It is a generator that yields a running sum of its input. Here's how you could make it work for your problem (I'm assuming your count function was supposed to be an iterator-friendly version of len):

    iter0, iter1, iter2 = itertools.tee(input_iter, 3)
    
    len_gen = itertools.accumulate(map(lambda x: 1, iter0))
    sum_gen = itertools.accumulate(iter1)
    sum_sq_gen = itertools.accumulate(map(lambda x: x*x, iter2))
    
    parallel_gen = zip(len_gen, sum_gen, sum_sq_gen)  # zip is a generator in Python 3
    
    for ilen, sum_x, sum_x_sq in parallel_gen:
        pass    # the generators do all the work, so there's nothing for us to do here
    
    # ilen_x, sum_x, sum_x_sq have the right values here!
    

    If you're using Python 2, rather than 3, you'll have to write your own accumulate generator function (there's a pure Python implementation in the docs I linked above), and use itertools.imap and itertools.izip rather than the builtin map and zip functions.