Search code examples
pythongeneratorenumerate

Using enumerate on a generator to parse text


I'm trying to iterate over a text file (containing several stories) and return a list of lists where each list is a new story.

  • read_lines_in_text(fname) is a generator that I want to iterate over to read each line in the text file. This must remain a generator.

  • find_title(fname) is a function that must be used and returns a list of the lines in the text where a title appears (and therefore signals the start of a new story).

The code I have written below does the job, but I think it is not a great solution.

newdict = {}
story = []
list_of_stories = []

for idx, line in enumerate(read_lines_in_text(fname)):
    if line in find_title(fname):
        newdict[idx] = line

for idx, line in enumerate(read_lines_in_text(fname)):
    if idx >= list(newdict.keys())[0]:
        if idx in newdict:
            list_of_stories.append(story)
            story = []
            story.append(line)
        else:
            story.append(line)

Given than I have the indexes of where each title occurs in the text, I want to have something like the following:

for lines between key i and key i+1 in mydict:
append to story
list_of_stories.append(story)
story = []

Solution

  • You do not need to use indices at all. Just start a new story list whenever you have a new title, and append the previous one to list_of_stories:

    story = []
    list_of_stories = []
    titles = set(find_title(fname))
    
    for line in read_lines_in_text(fname):
        if line in titles:
            # start a new story, append the previous
            if story:
                list_of_stories.append(story)
            story = [line]
        elif story:  # a story has been started
            story.append(line)
    
    # handle the last story
    if story:
        list_of_stories.append(story)
    

    When using a generator function, you really want to avoid treating it as a random access sequence with index numbers.

    Note that we also avoid reading fname more than once just to get the titles; the titles variable is a set of title strings returned by find_title(), stored as a set for fast membership testing.