Search code examples
pythonfastq

Iterating over a gzipped text file stops after for loop


I am trying to iterate and over a g-zipped text file. The format is that the data is in blocks of four lines. I need to take a percentage of these blocks and copy them to another file. My code takes in each block and then determines if it should be copied using random.random(). My problem is the code stops after the first block is selected and stops iterating over the g-zip file. Anyone have any ideas what I might be doing wrong?

Thanks! Chris

Link to file: Download

#Calc percentage of reads that should be sampled
per_reads = sub_reads_num/input_reads

#Read gzipped file and save selected lines to mem
output_list = []

input_f = gzip.open(input_path, 'rb')

counter = 0
buffer = []
for line in input_f:
    buffer.append(line)
    counter += 1
    if counter == 4:
        if random.random() < per_reads:
            for x in buffer:
                output_list.append(x)
        else:
            buffer = []
            counter = 0

input_f.close()

Solution

  • After you gather a group of four and decide to save it or not; reset your counter and buffer.

    for line in input_f:
        buffer.append(line)
        counter += 1
        if counter == 4:
            if random.random() < per_reads:
                for x in buffer:
                    output_list.append(x)
            buffer = []
            counter = 0
    

    Refactored to make use of enumerate and list.extend

    for line_no, line in enumerate(input_f, 1):
        buffer.append(line)
        if line_no % 4 == 0:
            if random.random() < per_reads:
                output_list.extend(buffer)
            buffer = []
    

    This test works,...

    output_list = []
    buffer = []
    input_f = list('abcdefghejklmnopqrstuvwxyz')
    for line_no, line in enumerate(input_f, 1):
        buffer.append(line)
        print(line_no)
        if line_no % 4 == 0:
    ##        if random.random() < per_reads:
            if random.choice((0,1,2,3)) < 2:
                print(buffer)
                output_list.extend(buffer)
            buffer = []
    

    Result:

    >>>
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ['e', 'j', 'k', 'l']
    13
    14
    15
    16
    17
    18
    19
    20
    ['q', 'r', 's', 't']
    21
    22
    23
    24
    25
    26
    >>> output_list
    ['e', 'j', 'k', 'l', 'q', 'r', 's', 't']
    >>> 
    

    Maybe your conditional is not working