Search code examples
pythonbiopython

A biopython script to split a large fasta file into multiple ones


I am working on a large fasta file I want to spliting into multiple ones according to the gene id. I am trying to use the above script from biopython tutorials:

def batch_iterator(iterator, batch_size):
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = iterator.next()
            except StopIteration:
                entry = None
            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch 

record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
    filename='gene_%i.fasta' % (i + 1)
    with open('/path/files/' + filename, 'w') as ouput_handle:
        count=SeqIO.write(batch, ouput_handle, 'fasta')
    print ('Wrote %i records to %s' % (count, filename))

It does split the files with 93 sequence in them but it gives 2 files per group of 93. I cannot see the error but I guess there is one. There is another way I could split the large fasta file in a different way? Thanks


Solution

  • In case there is anyone interested in this script in the future. The script works perfectly the way it is. The problems was that the file I was trying to divide had more sequences than it should. So I deleted the bad file, and produce a new one that split nicely with the above script.