Search code examples
pythoniteratorpython-itertools

Function to read fasta files not working after update python


I got a good code to read fasta files:

from itertools import groupby


def is_header(line):
    return line[0] == '>'


def parse_fasta(filename):
    if filename.endswith('.gz'):
        opener = lambda filename: gzip.open(filename, 'rb')
    else:
        opener = lambda filename: open(filename, 'r')
    with opener(filename) as f:
        fasta_iter = (it[1] for it in groupby(f, is_header))
        for name in fasta_iter:
            name = name.__next__()[1:].strip()
            sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
            yield name, sequences.upper()

And it worked fine until I update to Python 3.10.4, then when I try to use it I got this error:

Traceback (most recent call last):
  File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 21, in parse_fasta
    sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 112, in <module>
    sys.exit(main())
  File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 81, in main
    plasmid, chromosome = split_sequences_from_fasta_file(filename)
  File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 111, in split_sequences_from_fasta_file
    for header, seq in parse_fasta(filename)
RuntimeError: generator raised StopIteration

I run my code in conda (conda 4.13.0) environment, and I try to debug the function but I got stucked. I don't want to loose this function to try use Biopython. If you guys have any idea to fixe it I really appreciate. Thanks Paulo

Example of fasta file:

> seq_name
  AGGTAGGGA

The funny stuff is that, when I run the function in python interpret at the command line all worked fine, but when I call the functions from the script using the function imported thats when I got the errors.

>>> import gzip
>>> from itertools import groupby

>>> def is_header(line):
...     return line[0] == '>'
... 
>>> for name, seq in parse_fasta("/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/Genomes/Archaea/Asgard/Chr/GCA_008000775.1_ASM800077v1_genomic.fna.gz"):
...     print(name, seq[:50])
... 
CP042905.1 Candidatus Prometheoarchaeum syntrophicum strain MK-D1 chromosome, complete genome TAAATATTATAGCCCGTAATAGCAGAGTCACCAACACTTAAAGGTGCATC
>>> quit()

Solution

  • The exception you're getting is because you're manually calling the __next__ method on various iterators in your code. Eventually you do that on an iterator that doesn't have any values left, and you'll get a StopIteration exception raised.

    In much older versions of Python, that was OK to leave uncaught in a generator function. The StopIteration exception would continue to bubble up just like any other exception. For a generator function, raising StopIteration is an expected part of its behavior (it happens automatically when the function ends, either with a return statement, or by reaching the end of its code). In Python 3.5, this behavior changed, with PEP 479 making it an error for a StopIteration to go uncaught in a generator.

    Now, given the logic of your code, I'm not exactly sure why you're getting empty iterators. If the file is in the format you describe, the __next__ calls should always have a value to get, and the StopIteration that comes when there are no values left will be received by the for loop, which will suppress it (and just end the loop). Perhaps some of your files are not correctly formatted, with a header line by itself with no subsequent sequences?

    In any case, you can better diagnose the issue if you catch the StopIteration and print out some diagnostic information. I'd try:

    with opener(filename) as f:
        fasta_iter = (it[1] for it in groupby(f, is_header))
        for name in fasta_iter:
            name = name.__next__()[1:].strip()
            try:
                sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
            except StopIteration as si:
                print(f'no sequence for {name}')
                raise ValueError() from si
            yield name, sequences.upper()
    

    If you find that the missing sequence is a normal thing and happens at the end of every one of your files, then you could suppress the error by just putting a return statement in the except block, or maybe by using zip in your for loop (for name_iter, sequences_iter in zip(fasta_iter, fasta_iter):). I'd hesitate to jump to that as the first solution though, since it will throw away a header if there is an extra one, and silently losing data is generally a bad idea.