Search code examples
pythonbiopythonfasta

translate a mixed fasta file using python/biopython


so i have a program which fetches a bunch of sequences from a database and downloads them into a fasta file. the problem is that these sequences could be proteins or they could be dna. i'm splitting the large fasta file into many small fasta files and once i have the sequences, i need them all to be proteins. so i want to test each one to see if it's protein.

if they're all proteins, i'm fine, and if they're all dna, i have an elegant way to translate them all, but i need to find a way to test each new fasta file, translate it and have the translation replace the dna file

here's what i have so far:

from Bio import Entrez, SeqIO  
from Bio.Seq import Seq

record_iter = SeqIO.parse(open(output_file), 'fasta')
for seq_record in record_iter:
    outfile = '{0}.fa'.format(seq_record.id)
    count = SeqIO.write(seq_record,outfile,'fasta')
    xmlfile = '{0}.xml'.format(seq_record.id)
    print xmlfile   # Added this to show it's working, not stalled.
    if...
    #and here is where i would somehow test each "outfile" to see if it's dna or protein and then do something different with each one.  

i've tried setting it to a string (i think), and i can't use alphabets, because that's not how the fasta is formatted, and i've tried a bunch of other things. anyway, any help would be appreciated.

just for those who aren't familiar, a fasta file is of the following format:

>here is a bunch of identification information about the sequence after the carat.
GAAATTTGAGGCGTTCGCTGTGCAGTGAAAAGTGAGACTTTCTACTGTTCGCGTAGAAAGTGCAATAACC
AAGCCACCCACTCAGTGCCCAGACTAGCAACACAAGTCCGGCAAAATGGGAATCAAGTTCCTGGAAGTTA
TCAAACCGTTCTGCAGTATACTGCCGGAAATCGCAAAACCGGAGCGCAAGATCCAATTCAGGGAGAAAGT
GCTATGGACTGCGATCACCCTGTTCATCTTCCTGGTGTGCTGCCAGATCCCGCTTTTCGGTATCATGAGC
TCAGACTCGGCGGATCCCTTCTACTGGATCCGTGTGATCCTGGCCTCCAACCGTGGTACGCTCATGGAGC
TGGGTATCTCGCCCATCGTGACCTCTGGCCTCATTATGCAGCTGCTGGCCGGAGCA

Solution

  • I'm not familiar with the library, but I think the way you're suggesting is written as:

    if all(c.upper() in 'ATGC' for c in seq_record.seq):
        pass # it's DNA