Search code examples
pythonpython-2.xstopiteration

StopIteration after defining xrange


I wrote the following code to define blocks of 4 lines in a text file and output the block if the 2nd line of the block is composed of only one type of character. It is assumed (and previously verified) that the 2nd line is always composed of a string of 36 characters.

# filter out homogeneous reads

import sys
import collections
from collections import Counter

filename1 = sys.argv[1] # file to process

with open(filename1,'r') as input_file:
    for line1 in input_file:
        line2, line3, line4 = [next(input_file) for line in xrange(3)]
        c = Counter(line2).values() # count characters in line2
        c.sort(reverse=True) # sort values in descending order
        if c[0] < 36:
            print line1 + line2 + line3 + line4.rstrip()

However, I am getting a StopIteration error as follows. I would appreciate if someone could tell me why.

$ python code.py test.file > testout.file
Traceback (most recent call last):
  File "code.py", line 11, in <module>
    line2, line3, line4 = [next(input_file) for line in xrange(3)]
StopIteration

Any help would be appreciated, especially of the kind that explains what is wrong with my specific code and how to fix it. Here is an example of input:

@1:1:1323:1032:Y
AGCAGCATTGTACAGGGCTATCATGGAATTCTCGGG
+1:1:1323:1032:Y
HHHBHHBHBHGBGGGH8HHHGGGGFHBHHHHBHHHH
@1:1:1610:1033:Y
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+1:1:1610:1033:Y
HHEHHHHHHHHHHHBGGD>GGD@G8GGGGDHBHH4C
@1:1:1679:1032:Y
CGGTGGATCACTCGGCTCGTGCGTCGATGAAGAACG

Solution

  • Your example input already shows the problem: You have 10 lines there, which is not divisble by 4. So as you read the very last block, you get line1 and line2 but for the next() call for line3, the input is exhausted and you get nothing.

    It’s likely that you have the same issue in your full input file as well: The number of lines is simply not divisible by 4.

    There are a few ways to overcome this. The best is probably to fix your input since you seem to be expecting four lines all the way, there seems to be a content problem if that’s not what the input file gives.

    Another very simple fix would be to specify a default value with next():

    line2, line3, line4 = [next(input_file, '') for line in xrange(3)]
    

    Now, when next() would fail, the default value '' is instead returned. So even if the file is exhausted, you still get some content back.

    A probably better solution however would be to fix the way you iterate the file. You have two locations where you access the same file iterator, once in the outer for loop and three times in the list comprehension. It may seem simple enough so you won’t run into other problems, but you should really try to change this so that you only have a single location where you walk through the iterator; or only ever use next() calls, but mixing it with a for loop seems like a bad idea.

    You could for example use the grouper itertools recipe to cleanly iterate the file in groups of four:

    with open(filename1, 'r') as input_file:
        for line1, line2, line3, line4 in grouper(input_file, 4, fillvalue=''):
            # do things with the lines