Search code examples
pythonbiopythongenbank

Incomplete parsing of entire genbank file using python/biopython


The main goal of my script is to convert a genbank file to a gtf file. My problem pertains to extracting CDS information (gene, position (e.g., CDS 2598105..2598404), codon_start, protein_id, db_xref) from all CDS entries. My script should open/parse a genbank file, extract information from each CDS entry, and write the information to another file. The script produces no errors, but only writes information from the first 1/2 of the genbank file before terminating. Here is my code...

import Bio
from Bio import GenBank
from Bio import SeqIO

fileList = ['data_files/e_coli_ref_BA000007.2.gb']
qualies = ['gene', 'protein_id', 'db_xref']

#######################################################DEFINITIONS################################################################
def strip_it(string_name):
    stripers = ['[', ']', '\'', '"']
    for s in stripers:
        string_name = string_name.replace(s, '')
        string_name = string_name.lstrip()
    return string_name

def strip_it_attributes(string_name):
    stripers = ['[', ']', '\'', '"', '{', '}',',']
    for s in stripers:
        string_name = string_name.replace(s, '') 
        string_name = string_name.lstrip() 
        string_name = string_name.replace(': ', '=')
        string_name = string_name.replace(' ', ';')
    return string_name
#---------------------------------------------------------------------------------------------------------------------------------

#######################################################################################################################
for f in fileList:
    nameOut = f.replace('gb', 'gtf')

    with open(f, 'r') as inputFile:
        with open(nameOut, 'w') as outputFile:
            record = next(SeqIO.parse(f, 'genbank'))
            seqid = record.id
            typeName = 'Gene'
            source = 'convert_gbToGFT.py'
            start_codon = 'NA'
            attribute = 'NA'    

            featureCount = 0
            for f in record.features:
                print(f.type)
                string = ''
                if f.type == 'CDS':
                    dic = {}
                    CDS = record.features[featureCount]

                    position = strip_it(str(CDS.location))
                    start = position.split(':')[0]
                    stop = position.split(':')[1].split('(')[0]
                    strand = position.split(':')[1].split('(')[1].replace(')', '')
                    score = '.'

                    for q in qualies:
                        if q in CDS.qualifiers:
                            if q not in dic:
                                dic[q] = ''
                            dic[q] = strip_it(str(CDS.qualifiers[q]))

                    attribute = strip_it_attributes(str(dic))

                    if 'codon_start' in CDS.qualifiers:
                        start_codon = str(int(str(CDS.qualifiers['codon_start'][0]))-1) #need string when finished so it can be added to variable 'string'

                    string = '\t'.join([seqid, source, typeName, start, stop, score, strand, start_codon, attribute])
                    if attribute.count(';') == 2:
                        outputFile.write(string + '\n')

                    featureCount+=1

#---------------------------------------------------------------------------------------------------------------------------------

The last line of the output file is:

BA000007.2  convert_gbToGFT.py   Gene  2598104  2598404  .  +  0  protein_i     d=BAB36052.1;db_xref=GI:13362097;gene=ECs2629

The location of gene ECs2629 appears on line 36094 in the genbank file, but the total number of lines in this file is 73498. I have re-downloaded the file multiple times to see if there was a downloading issue and I have visually inspected the file (I find no fault with it). I have also tried this script on another equally large genbank file and was met with identical issues.

Can anyone offer some suggestions as to why the entire genbank file is not parsed, how I could modify my code to remove this issue, or point me to another possible solution?

(you can see the format of a genbank file from here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html), however, I am working with an E. coli genbank file (Escherichia coli O157:H7 str. Sakai DNA, complete genome) which can be found here: http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

I am using the following: Centos 6.7, Python 3.4.3 :: Anaconda 2.3.0 (64-bit), Biopython 1.66

[EDIT] @Gerrat suggestions worked for the file in question, but not for other files. Using http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3 with the suggested edit yields ~28 lines of output where my original code output 2084 lines (however, there should be 4332 lines of output).


Solution

  • Thank you @Gerrat for your comments. I re-worked the script and it works swimmingly.

    import Bio 
    from Bio import GenBank
    from Bio import SeqIO
    
    fileList = ['F1.gb', 'F2.gb']
    
    for f in fileList:
          with open(f, 'rU') as handle:
             for record in SeqIO.parse(handle, 'genbank'):
                for feature in record.features:
                   if feature.type=='CDS':
                      #[extract feature values here]
                      count+=1
    
       print('You parsed', count, 'CDS features')