Search code examples
pythonbiopythonncbi

getting a gene sequence from entrez using biopython


This is what I want to do. I have a list of gene names for example: [ITGB1, RELA, NFKBIA]

Looking up the help in biopython and tutorial for API for entrez I came up with this:

x = ['ITGB1', 'RELA', 'NFKBIA']
for item in x:
    handle = Entrez.efetch(db="nucleotide", id=item ,rettype="gb")
    record = handle.read()
    out_handle = open('genes/'+item+'.xml', 'w') #to create a file with gene name
    out_handle.write(record)
    out_handle.close

But this keeps erroring out. I have discovered that if the id is a numerical id (although you have to make it in to a string to use, '186972394' so:

handle = Entrez.efetch(db="nucleotide", id='186972394' ,rettype="gb")

This gets me the info I want which includes the sequence.

So now to the Question: How can I search gene names (cause I do not have id numbers) or easily convert my gene names to ids to get the sequences for the gene list I have.

Thank you,


Solution

  • first with the gene name eg: ATK1

    item = 'ATK1'
    animal = 'Homo sapien' 
    search_string = item+"[Gene] AND "+animal+"[Organism] AND mRNA[Filter] AND RefSeq[Filter]"
    

    Now we have a search string to seach for ids

    handle = Entrez.esearch(db="nucleotide", term=search_string)
    record = Entrez.read(handleA)
    ids = record['IdList']
    

    this returns ids as a list if and if no id found it's []. Now lets assume it return 1 item in the list.

    seq_id = ids[0] #you must implement an if to deal with <0 or >1 cases
    handle = Entrez.efetch(db="nucleotide", id=seq_id, rettype="fasta", retmode="text")
    record = handleA.read()
    

    this will give you a fasta string which you can save to a file

    out_handle = open('myfasta.fasta', 'w')
    out_handle.write(record.rstrip('\n'))