I got a good code to read fasta files:
from itertools import groupby
def is_header(line):
return line[0] == '>'
def parse_fasta(filename):
if filename.endswith('.gz'):
opener = lambda filename: gzip.open(filename, 'rb')
else:
opener = lambda filename: open(filename, 'r')
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
yield name, sequences.upper()
And it worked fine until I update to Python 3.10.4, then when I try to use it I got this error:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 21, in parse_fasta
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 112, in <module>
sys.exit(main())
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 81, in main
plasmid, chromosome = split_sequences_from_fasta_file(filename)
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 111, in split_sequences_from_fasta_file
for header, seq in parse_fasta(filename)
RuntimeError: generator raised StopIteration
I run my code in conda (conda 4.13.0) environment, and I try to debug the function but I got stucked. I don't want to loose this function to try use Biopython. If you guys have any idea to fixe it I really appreciate. Thanks Paulo
Example of fasta file:
> seq_name
AGGTAGGGA
The funny stuff is that, when I run the function in python interpret at the command line all worked fine, but when I call the functions from the script using the function imported thats when I got the errors.
>>> import gzip
>>> from itertools import groupby
>>> def is_header(line):
... return line[0] == '>'
...
>>> for name, seq in parse_fasta("/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/Genomes/Archaea/Asgard/Chr/GCA_008000775.1_ASM800077v1_genomic.fna.gz"):
... print(name, seq[:50])
...
CP042905.1 Candidatus Prometheoarchaeum syntrophicum strain MK-D1 chromosome, complete genome TAAATATTATAGCCCGTAATAGCAGAGTCACCAACACTTAAAGGTGCATC
>>> quit()
The exception you're getting is because you're manually calling the __next__
method on various iterators in your code. Eventually you do that on an iterator that doesn't have any values left, and you'll get a StopIteration
exception raised.
In much older versions of Python, that was OK to leave uncaught in a generator function. The StopIteration
exception would continue to bubble up just like any other exception. For a generator function, raising StopIteration
is an expected part of its behavior (it happens automatically when the function ends, either with a return
statement, or by reaching the end of its code). In Python 3.5, this behavior changed, with PEP 479 making it an error for a StopIteration
to go uncaught in a generator.
Now, given the logic of your code, I'm not exactly sure why you're getting empty iterators. If the file is in the format you describe, the __next__
calls should always have a value to get, and the StopIteration
that comes when there are no values left will be received by the for
loop, which will suppress it (and just end the loop). Perhaps some of your files are not correctly formatted, with a header line by itself with no subsequent sequences?
In any case, you can better diagnose the issue if you catch the StopIteration
and print out some diagnostic information. I'd try:
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
try:
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
except StopIteration as si:
print(f'no sequence for {name}')
raise ValueError() from si
yield name, sequences.upper()
If you find that the missing sequence is a normal thing and happens at the end of every one of your files, then you could suppress the error by just putting a return
statement in the except
block, or maybe by using zip
in your for
loop (for name_iter, sequences_iter in zip(fasta_iter, fasta_iter):
). I'd hesitate to jump to that as the first solution though, since it will throw away a header if there is an extra one, and silently losing data is generally a bad idea.