Search code examples
pythonbiopythongenbank

Modify location of a genbank feature


Edit : I know feature.type will give gene/CDS and feature.qualifiers will then give "db_xref"/"locus_tag"/"inference" etc. Is there a feature. object which will allow me to access the location (eg: [5240:7267](+) ) directly?

This URL give a bit more info, though I can't figure out how to use it for my purpose... http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html#location_operator

Original Post:

I am trying to modify the location of features within a GenBank file. Essentially, I want to modify the following bit of a GenBank file:

 gene            5240..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5240..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 ...........................

to

 gene            5357..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5357..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 .............................

Note the changes from 5240 to 5357

So far, from scouring the internet and Stackoverflow, I have:

from Bio import SeqIO
gb_file = "mtbtomod.gb"
gb_record = SeqIO.parse(open(gb_file, "r+"), "genbank")
rvnumber = 'Rv0005'
newstart = 5357

final_features = []

for record in gb_record:
  for feature in record.features:
    if feature.type == "gene":
        if feature.qualifiers["locus_tag"][0] == rvnumber:
            if feature.location.strand == 1:
                feature.qualifiers["amend_position"] = "%s:%s" % (newstart, feature.location.end+1)
            else:
                # do the reverse for the complementary strand
    final_features.append(feature)
  record.features = final_features
  with open("testest.gb","w") as testest:
    SeqIO.write(record, testest, "genbank")

This basically creates a new qualifier called "amend_position".. however, what I would like to do is modify the location directly (with or without creating a new file...)

Rv0005 is just an example of a locus_tag I need to update. I have about 600 more locations to update, which explains the need for a script.. Help!


Solution

  • Ok, I have something which now fully works. I'll post the code in case anyone ever needs something similar

    __author__ = 'Kavin'
    
    from Bio import SeqIO
    from Bio import SeqFeature
    import xlrd
    import sys
    import re
    
    workbook = xlrd.open_workbook(sys.argv[2])
    sheet = workbook.sheet_by_index(0)
    data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
    
    # Create dicts to store TSS data
    TSS = {}
    row = {}
    # For each entry (row), store the startcodon and strand information
    for i in range(2, sheet.nrows - 1):
        if data[i][5] < -0.7:   # Ensures BASS score is within significant range
            Gene = data[i][0]
            row['Direction'] = str(data[i][3])
            row['StartCodon'] = int(data[i][4])
            TSS[str(Gene)] = row
            row = {}
        else:
            i += 1
    
    # Create an output filename based on input filename
    outfile_init = re.search('(.*)\.(\w*)', sys.argv[1])
    outfile = str(outfile_init.group(1)) + '_modified.' + str(outfile_init.group(2))
    
    final_features = []
    for record in SeqIO.parse(open(sys.argv[1], "r"), "genbank"):
        for feature in record.features:
            if feature.type == "gene" or feature.type == "CDS":
                if TSS.has_key(feature.qualifiers["locus_tag"][0]):
                    newstart = TSS[feature.qualifiers["locus_tag"][0]]['StartCodon']
                    if feature.location.strand == 1:
                        feature.location = SeqFeature.FeatureLocation(SeqFeature.ExactPosition(newstart - 1),
                                                                      SeqFeature.ExactPosition(
                                                                          feature.location.end.position),
                                                                      feature.location.strand)
                    else:
                        feature.location = SeqFeature.FeatureLocation(
                            SeqFeature.ExactPosition(feature.location.start.position),
                            SeqFeature.ExactPosition(newstart), feature.location.strand)
            final_features.append(feature)  # Append final features
        record.features = final_features
        with open(outfile, "w") as new_gb:
            SeqIO.write(record, new_gb, "genbank")
    

    This assumes usage such as python program.py <genbankfile> <excel spreadsheet>

    This also assumes a spreadsheet of the following format:

    Gene Synonym Tuberculist_annotated_start Orientation Re-annotated_start BASS_score

    Rv0005 gyrB 5240 + 5357 -1.782

    Rv0012 Rv0012 14089 + 14134 -1.553

    Rv0018c pstP 23181 - 23172 -2.077

    Rv0032 bioF2 34295 + 34307 -0.842

    Rv0037c Rv0037c 41202 - 41163 -0.554