Search code examples
pythonfilesplitbiopythonfasta

Splitting a multisequence fasta protein file into many files using Biopython


def batch_iterator(iterator, batch_size) :
    entry = True
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.__next__
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch



from Bio import SeqIO

record_iter = SeqIO.parse(open("C:\\Users\\IDEAPAD\Desktop\\fypsplit\\protein.fasta"),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1000)):
    filename = "group_%i.fasta" % (i + 1)
    with open(filename, "w") as handle:
        count = SeqIO.write(batch, handle, "fasta")
    print("Wrote %i records to %s" % (count, filename))

I am trying to split a fasta file using Biopython. I want to make it like 7 files in this example. But I am getting an error reading AttributeError: 'function' object has no attribute 'id'.

Can someone help me? Thank you in advance


Solution

  • The AttributeError is thrown in this line

    count = SeqIO.write(batch, handle, "fasta")
    

    because SeqIO.write expects an iterable or list of type SeqRecord. However, your batch_iterator produces a list of methods instead.

    Why methods? Well, you are missing a function call here:

    entry = iterator.__next__
    

    should be

    entry = iterator.__next__()
    

    This makes the code run through without error.

    For a test file consisting of 11 sequences, I got the following result - after changing the batch size from 1000 to 4 for testing purposes:

    Wrote 4 records to group_1.fasta
    Wrote 4 records to group_2.fasta
    Wrote 3 records to group_3.fasta