Search code examples
pythonbioinformaticsbiopythonfasta

How to split a multi-fasta file into chunks of equal sequence length AND change the headers using biopython


First, I apologize for my pythonic ignorance. I need to break up my multi-sequence fasta file into equal sized chunks for a pipeline downstream. I haven't run across anything that does this easily or in the format I'm looking for.

An example fasta file input:

original.fas

>contig1

ACGTA

>contig2

GGGATAGTCA

>contig3

GACTACTTTT

The above example fasta has 25bp. If I set the "chunk number" parameter to '4', then I would expect my output files to all have 7 base pairs, except for the last file which would have the remainder 4bp. It would look like this:

chunk1.fas

>contig1:0-4

ACGTA

>contig2:0-1

GG

chunk2.fas

>contig2:2-7

GATAGTC

chunk3.fas

>contig2:9-9

A

>contig3:0-5

GACTAC

chunk4.fas

>contig3:6-9

TTTT

Notice each resulting chunk*.fas includes 7 base pairs, except for the leftover base pairs in chunk4.fas. Also, each resulting sequence header in the chunk files have changed from the original, so that they include a ":" as well as a "start" and "stop" location derived from the original sequence.

The biopython cookbook has a pretty nice batch size iterator tool and I think my answer resides within manipulating this code, but I haven't a clue how to go about this.

Any help is appreciated. Cheers.


def batch_iterator(iterator, batch_size):
    """Returns lists of length batch_size.
    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.
    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = next(iterator)
            except StopIteration:
                entry = False
            if not entry:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch

record_iter = SeqIO.parse('aVan.fa', 'fasta')

for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
    filename = 'group_{}.fasta'.format(i)
    count = SeqIO.write(batch, filename, 'fasta')
    print('Wrote {} records to {}'.format(count, filename))

Solution

  • That is not an easy task, but have a look at this implementation:

    from Bio import SeqIO
    from Bio.SeqRecord import SeqRecord
    
    chunk_number = 4
    records = list(SeqIO.parse("input.fasta", "fasta"))
    chunk_size = sum(len(r) for r in records) // chunk_number + 1
    
    
    def create_batch(records, chunk_size):
        record_it = iter(records)
    
        record = next(record_it)
        current_base = 0
    
        batch = []
        batch_size = 0
    
        # While there are new records, keep creating new batches.
        while record:
            # Loop over records untill the batch is full. (or no new records)
            while batch_size != chunk_size and record:
    
                end = current_base + chunk_size - batch_size
                seq = record[current_base:end]
    
                end_of_slice = current_base + len(seq) - 1
                fasta_header = record.id + ":{}-{}".format(current_base, end_of_slice)
    
                seq.id = seq.name = fasta_header
                seq.description = ''
                batch.append(seq)
    
                current_base += len(seq)
                batch_size += len(seq)
    
                # Current record is exhausted, get a new one.
                if current_base >= len(record):
                    record = next(record_it, None)
                    current_base = 0
    
            # We have a batch with the correct size (or no new bathces)
            yield batch
            batch = []
            batch_size = 0
    
    
    for i, batch in enumerate(create_batch(records, chunk_size)):
        filename = "chunk{}.fasta".format(i)
        SeqIO.write(batch, filename, "fasta")