Search code examples
biopythonfastq

Biopython -- reading a fixed number of seq_records at a time


I built some code that retrieves PHRED scores from a fastq file, puts them all into a single list, and then passes the list on to another function. It looks like so:

def PHRED_get():
    temp_scores = []
    all_scores = []
    fastq_location
    print("Building PHRED score bins...")
    for seq_record in SeqIO.parse(fastq_location, "fastq"):
        temp_scores = seq_record.letter_annotations
        temp_scores = temp_scores['phred_quality']
        all_scores.append(temp_scores)
    all_scores = list(itertools.chain(*all_scores))
    score_bin_maker(all_scores)

The problem is that this loop will continue until all seq_records have been searched and corresponding PHRED scores retrieved. In order to be more RAM conservative, I'd like to have some code that reads a smaller number of seq_records at a time (say, 100), and then pops their respective quality scores onto my ongoing uberlist. It would then go grab info from the next 100 seq_records and do the loop again. I'm having trouble understanding how to get this done. Any ideas?


Solution

  • Simple: Keep a counter and when it reaches 100, break from the loop. Or some other early halt condition like if len(temp_scores) > 1000: break would work too.

    Elegant: Use itertools.islice to take just the first 100 records from the iterator,

    import itertools
    
    def PHRED_get():
        temp_scores = []
        all_scores = []
        fastq_location
        print("Building PHRED score bins...")
        for seq_record in itertools.islice(SeqIO.parse(fastq_location, "fastq"), 100):
            temp_scores = seq_record.letter_annotations
            temp_scores = temp_scores['phred_quality']
            all_scores.append(temp_scores)
        all_scores = list(itertools.chain(*all_scores))
        score_bin_maker(all_scores)