I am working on a large fasta file I want to spliting into multiple ones according to the gene id. I am trying to use the above script from biopython tutorials:
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
filename='gene_%i.fasta' % (i + 1)
with open('/path/files/' + filename, 'w') as ouput_handle:
count=SeqIO.write(batch, ouput_handle, 'fasta')
print ('Wrote %i records to %s' % (count, filename))
It does split the files with 93 sequence in them but it gives 2 files per group of 93. I cannot see the error but I guess there is one. There is another way I could split the large fasta file in a different way? Thanks
In case there is anyone interested in this script in the future. The script works perfectly the way it is. The problems was that the file I was trying to divide had more sequences than it should. So I deleted the bad file, and produce a new one that split nicely with the above script.