I built some code that retrieves PHRED scores from a fastq file, puts them all into a single list, and then passes the list on to another function. It looks like so:
def PHRED_get():
temp_scores = []
all_scores = []
fastq_location
print("Building PHRED score bins...")
for seq_record in SeqIO.parse(fastq_location, "fastq"):
temp_scores = seq_record.letter_annotations
temp_scores = temp_scores['phred_quality']
all_scores.append(temp_scores)
all_scores = list(itertools.chain(*all_scores))
score_bin_maker(all_scores)
The problem is that this loop will continue until all seq_records have been searched and corresponding PHRED scores retrieved. In order to be more RAM conservative, I'd like to have some code that reads a smaller number of seq_records at a time (say, 100), and then pops their respective quality scores onto my ongoing uberlist. It would then go grab info from the next 100 seq_records and do the loop again. I'm having trouble understanding how to get this done. Any ideas?
Simple: Keep a counter and when it reaches 100, break from the loop. Or some other early halt condition like if len(temp_scores) > 1000: break
would work too.
Elegant: Use itertools.islice to take just the first 100 records from the iterator,
import itertools
def PHRED_get():
temp_scores = []
all_scores = []
fastq_location
print("Building PHRED score bins...")
for seq_record in itertools.islice(SeqIO.parse(fastq_location, "fastq"), 100):
temp_scores = seq_record.letter_annotations
temp_scores = temp_scores['phred_quality']
all_scores.append(temp_scores)
all_scores = list(itertools.chain(*all_scores))
score_bin_maker(all_scores)