Search code examples
pythongeneratorpython-itertools

itertools.takewhile within a generator function - why is it evaluated once only?


I have a text file like this:

11
2
3
4

11

111

Using Python 2.7, I want to turn it into a list of lists of lines, where line breaks divide items in the inner list and empty lines divide items in the outer list. Like so:

[["11","2","3","4"],["11"],["111"]]

And for this purpose, I wrote a generator function that would yield the inner lists one at a time once passed an open file object:

def readParag(fileObj):
    currentParag = []
    for line in fileObj:
        stripped = line.rstrip()
    if len(stripped) > 0: currentParag.append(stripped)
    elif len(currentParag) > 0:
        yield currentParag
        currentParag = []

That works fine, and I can call it from within a list comprehension, producing the desired result. However, it subsequently occurred to me that I might be able to do the same thing more concisely using itertools.takewhile (with a view to rewriting the generator function as a generator expression, but we'll leave that for now). This is what I tried:

from itertools import takewhile    
def readParag(fileObj):
    yield [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]

In this case, the resulting generator yields only one result (the expected first one, i.e. ["11","2","3","4"]). I had hoped that calling its next method again would cause it to evaluate takewhile(lambda line: line != "\n", fileObj) again on the remainder of the file, thus leading it to yield another list. But no: I got a StopIteration instead. So I surmised that the take while expression was being evaluated once only, at the time when the generator object was created, and not each time I called the resultant generator object's next method.

This supposition made me wonder what would happen if I called the generator function again. The result was that it created a new generator object that also yielded a single result (the expected second one, i.e. ["11"]) before throwing a StopIteration back at me. So in fact, writing this as a generator function effectively gives the same result as if I'd written it as an ordinary function and returned the list instead of yielding it.

I guess I could solve this problem by creating my own class to use instead of a generator (as in John Millikin's answer to this question). But the point is that I was hoping to write something more concise than my original generator function (possibly even a generator expression). Can somebody tell me what I'm doing wrong, and how to get it right?


Solution

  • What you're trying to do is a perfect job for groupby:

    from itertools import groupby
    
    def read_parag(filename):
        with open(filename) as f:
            for k,g in groupby((line.strip() for line in f), bool):
                if k:
                    yield list(g)
    

    which will give:

    >>> list(read_parag('myfile.txt')
    [['11', '2', '3', '4'], ['11'], ['111']]
    

    Or in one line:

    [list(g) for k,g in groupby((line.strip() for line in open('myfile.txt')), bool) if k]