Search code examples
pythonlistbioinformaticsbiopythongenbank

Iterating through a series of GenBank genes and appending each gene's features to a list returns only the last gene


I'm having a problem with my code. I'm trying to iterate through the genbank file's list of genes using BioPython. Here's what it looks like:

class genBank:
    gbProtId = str()
    gbStart = int()
    gbStop = int()
    gbStrand = int()

genBankEntries = list()

for seq_record in SeqIO.parse(genBankFile, "genbank"):
    for seq_feature in seq_record.features:
        genBankEntry = genBank
        if seq_feature.type == "CDS":
            genBankEntry.gbProtId = seq_feature.qualifiers['protein_id']
            genBankEntry.gbStart = seq_feature.location.start # prodigal GFF3 output is 1 based indexing
            genBankEntry.gbStop = seq_feature.location.end 
            genBankEntry.gbStrand = seq_feature.strand
            genBankEntries.append(genBankEntry)

It looks like it should work, but when I run it, the resulting structure genBankEntries is just an enormous stack the size of the number of genes in the genbank file but with only the final value in seq_record.features as each list element:

00 = {type} <class '__main__.genBank'>
 gbProtId = {list} ['BAA31840.1']
 gbStart = {ExactPosition} 90649
 gbStop = {ExactPosition} 91648
 gbStrand = {int} 1
...
82 = {type} <class '__main__.genBank'>
 gbProtId = {list} ['BAA31840.1']
 gbStart = {ExactPosition} 90649
 gbStop = {ExactPosition} 91648
 gbStrand = {int} 1

This is especially confusing because both for-loops seem to work correctly:

for seq_record in SeqIO.parse(genBankFile, "genbank"):
    for seq_feature in seq_record.features:
        print(seq_feature)

Why is this?


Solution

  • You are never creating any instances of the genBank class. Each loop iteration is changing class-level attributes of the genBank class, and you are adding the same object to the list each time. Each pass through the loop overwrites the values in the previous pass.

    For the first line in your inner loop, add parenthesis to call the type and create an instance of genBank. It will instead be genBankEntry = genBank(). This creates a new distinct object for each loop pass.