I'm trying to create a function in python that reads both unambiguous and ambiguous nucleotide sequences from a fasta file and returns sequence ID and molecular weight.
I tried this with the following code:
import Bio
from Bio import SeqUtils, SeqIO
def function(filename):
nucleotides ={'A','T', 'C', 'G'}
with open(filename) as file:
for record in SeqIO.parse(file, "fasta"):
for nucl in record:
if nucl in nucleotides:
continue
else:
print(str(record.id)+": is ambiguous")
break
else:
mol_weight= Bio.SeqUtils.molecular_weight(record)
print(str(record.id)+": is unambiguous & molecular weight = "+str(mol_weight))
function("random.fasta")
If I use this code on ambiguous sequences, there is absolutely no problem and I get the result that I had in mind. If I however include unambiguous sequences, I get "ValueError: 'I' is not a valid unambiguous letter for DNA" caused by the 'Bio.SeqUtils.molecular_weight(record)' function which in my mind doesn't make sense as the letter 'I' would normally cause the break in the first else statement if I got this correctly.
I also used a custom function to manually calculate the molecular weight (based on fixed values) and in this case there are no errors and the function works just fine for both ambiguous and unambiguous sequences. Someone pointed out however that my manual calculation isn't as accurate as the built-in function.
I'm kind of new to Python, but I hope that someone might know why the ValueError still occurs and how to solve it.
I think the error happens in the expression mol_weight= Bio.SeqUtils.molecular_weight(record)
.
Record is implicitly converted to it's string representation ID: SEQUENCE_1 ...
. So the 'I'
in the ValueError does not correspond to a nucleotide at all but to a rather random string value.
Try this line instead: mol_weight = Bio.SeqUtils.molecular_weight(record.seq)
For me it returned the desired result:
SEQUENCE_1: is unambiguous & molecular weight = 331.2218
for a trivial sequence (just letter A) that would before provoke the error you described.
>SEQUENCE_1
A
Update:
The complete solutions looks like this:
from Bio import SeqIO, SeqUtils
def function(filename):
with open(filename) as file:
for record in SeqIO.parse(file, "fasta"):
try:
mol_weight = SeqUtils.molecular_weight(record.seq)
print(f"{record.id}: is unambiguous & molecular weight = {mol_weight}")
except ValueError as exception:
print(f"{record.id}: {exception}")
if __name__ == '__main__':
function("random.fasta")
I included the execption handling from @pippo1980's answer as it is indeed a bit simpler, faster and arguably more pythonic.