so i have a program which fetches a bunch of sequences from a database and downloads them into a fasta file. the problem is that these sequences could be proteins or they could be dna. i'm splitting the large fasta file into many small fasta files and once i have the sequences, i need them all to be proteins. so i want to test each one to see if it's protein.
if they're all proteins, i'm fine, and if they're all dna, i have an elegant way to translate them all, but i need to find a way to test each new fasta file, translate it and have the translation replace the dna file
here's what i have so far:
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
record_iter = SeqIO.parse(open(output_file), 'fasta')
for seq_record in record_iter:
outfile = '{0}.fa'.format(seq_record.id)
count = SeqIO.write(seq_record,outfile,'fasta')
xmlfile = '{0}.xml'.format(seq_record.id)
print xmlfile # Added this to show it's working, not stalled.
if...
#and here is where i would somehow test each "outfile" to see if it's dna or protein and then do something different with each one.
i've tried setting it to a string (i think), and i can't use alphabets, because that's not how the fasta is formatted, and i've tried a bunch of other things. anyway, any help would be appreciated.
just for those who aren't familiar, a fasta file is of the following format:
>here is a bunch of identification information about the sequence after the carat.
GAAATTTGAGGCGTTCGCTGTGCAGTGAAAAGTGAGACTTTCTACTGTTCGCGTAGAAAGTGCAATAACC
AAGCCACCCACTCAGTGCCCAGACTAGCAACACAAGTCCGGCAAAATGGGAATCAAGTTCCTGGAAGTTA
TCAAACCGTTCTGCAGTATACTGCCGGAAATCGCAAAACCGGAGCGCAAGATCCAATTCAGGGAGAAAGT
GCTATGGACTGCGATCACCCTGTTCATCTTCCTGGTGTGCTGCCAGATCCCGCTTTTCGGTATCATGAGC
TCAGACTCGGCGGATCCCTTCTACTGGATCCGTGTGATCCTGGCCTCCAACCGTGGTACGCTCATGGAGC
TGGGTATCTCGCCCATCGTGACCTCTGGCCTCATTATGCAGCTGCTGGCCGGAGCA
I'm not familiar with the library, but I think the way you're suggesting is written as:
if all(c.upper() in 'ATGC' for c in seq_record.seq):
pass # it's DNA