Search code examples
pythonfastancbigenbank

Get Accession Numbers from NCBI from corresponding GI Numbers in fasta headers in python


I keep seeing warnings on Genbank that they are phasing out GI numbers and have a number of fasta files saved where I've edited the headers in the following format:

>SomeText_ginumber

I've no idea where to even begin with this but is there a way, ideally with python, that I could get the corresponding accession numbers for each gi from NCBI and output a file with the headers as follows:

>SomeText_accessionnumber

Here's another example of the file format:

>Desulfovibrio_fructosivorans_ferredoxin_492837709
MTKSVAPTVTIDGKVVPIEGERNLLELIRKVRIDLPTFCYHSELSVYGACRLCLVEVKNRGIMGAC
>Oxalobacteraceae_bacterium_AB_14_hypothetical_protein_522195384
MIKLTVNGIPVEVDEGATYLDAANKAGVHIPTLCYHPRFRSHAVCRMCLVHVAGSSRPQAACIGKA

Edit/Update:

from Bio import Entrez
from time import sleep
import sys
import re
Entrez.email = ''

seqs = open(sys.argv[1],"r")

for line in seqs:
    if line.startswith('>'):
        gi = re.findall("\d{5,}",line)
        matches = len(gi)
        #print(matches)
        if matches < 2:
            if gi:
                handle = Entrez.efetch(db="nucleotide", id=gi, retmode="xml")
                records = Entrez.read(handle)
                print(line[0:line.rfind('_') + 1] + records[0]['GBSeq_primary-accession'])
                sleep(1)
        elif matches >= 2:
             print("Error - More than one ginumber in header!")
             break
        else:
            seq=line.rstrip()
            print(seq)
    else:
        seq1=line.rstrip()
        print(seq1)

Solution

  • Try using BioPython.

    The following snippet should get you started. First get the GI from the header (the part of the header after the underscore), get the data from GenBank, print the old header but with the accession number and then the rest of your input sequences, done :)

    This works for your two examples but will probably fail with more data (missing GI, etc.). Also the accession numbers have underscores, just like your header, will will complicate parsing later. Perhaps replace the underscore with something else or add another separator.

    from Bio import Entrez
    from time import sleep
    Entrez.email = 'your@email'
    
    seqs = """>Desulfovibrio_fructosivorans_ferredoxin_492837709
    MTKSVAPTVTIDGKVVPIEGERNLLELIRKVRIDLPTFCYHSELSVYGACRLCLVEVKNRGIMGAC
    >Oxalobacteraceae_bacterium_AB_14_hypothetical_protein_522195384
    MIKLTVNGIPVEVDEGATYLDAANKAGVHIPTLCYHPRFRSHAVCRMCLVHVAGSSRPQAACIGKA"""
    
    
    for line in seqs.splitlines():
        if line.startswith('>'):
            gi = line[line.rfind('_') + 1:]
            handle = Entrez.efetch(db="nucleotide", id=gi, retmode="xml")
            records = Entrez.read(handle)
            print(line[0:line.rfind('_') + 1] + records[0]['GBSeq_primary-accession'])
            sleep(1)
        else:
            print(line)