Search code examples
pythonpython-3.xiteratorgeneratorbiopython

Python 3.x - How to efficiently split an array of objects into smaller batch files?


I'm fairly new to Python and I'm attempting to split a textfile where entries consists of two lines into batches of max. 400 objects.

The data I'm working with are thousands of sequences in FASTA format (plain text with a header, used in bioinformatics) where entries look like this:

>HORVU6Hr1G000325.5

PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA

>HORVU6Hr1G000326.1

MVKFTAEELRGIMDKKNNIRNMSVIAHVD

...

In Biopython, there is a parser SeqIO.parse which allows to access these as an array of objects consisting of IDs and strings, which I need to use in later parts of my code, and since I need to be memory efficient, I'd like to avoid reading/parsing the source file more times than necessary.

In Biopython manual, there's a recommended way to do this via a generator, which I'm using: https://biopython.org/wiki/Split_large_file

However, I'm using Python 3.7 whilst the code there is in Python 2.x, so there are definitely some changes necessary. I've changed the

entry = iterator.next()

into

entry = next(iterator)

but I'm not sure if that's all I need to change.

Here's the code:

def batch_iterator(iterator, batch_size=400):
    """Returns lists of length batch_size."""
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = next(iterator)
            except StopIteration:
                entry = None

            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch

while True:
    bsequence = input("Please enter the full path to your FASTA file(e.g. c:\\folder1\\folder2\\protein.fasta):\n")
    try:
        fastafile = open(bsequence)
        break
    except:
        print("File not found!\n")            


record_iter = SeqIO.parse(fastafile,"fasta")
num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

print("num=%i" % (num,))
if num > 400:
    print("The specified file contains %i sequences. It's recommended to split the FASTA file into batches of max. 400 sequences.\n" % (num,))
    while True:
        decision = input("Do you wish to create batch files? (Original file will not be overwritten)\n(Y/N):")
        if (decision == 'Y' or 'y'):
            for i, batch in enumerate(batch_iterator(record_iter, 400), 1):
                filename = "group_%i.fasta" % (i + 1)
                with open(filename, "w") as handle:
                    count = SeqIO.write(batch, handle, "fasta")
                print("Wrote %i records to %s" % (count, filename))
            break
        elif (decision == 'N' or 'n'):
            break
        else:
            print('Invalid input\n')

...next part of the code

When I run this, after the Y/N prompt, even if I type Y, the program just skips over to the next part of my code without creating any new file. Debugger shows the following:

Do you wish to create batch files? (Original file will not be overwritten)
(Y/N):Y
Traceback (most recent call last):
  File "\Biopython\mainscript.py", line 32, in batch_iterator
    entry = next(iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1569, in _trace
    return self._trace_and_catch(frame, event, arg)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1611, in _trace_and_catch
    frame.f_back, event, marker_function_args, node

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1656, in _handle_progress_event
    self._save_current_state(frame, event, args, node)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1738, in _save_current_state
    exception_info = self._export_exception_info()

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1371, in _export_exception_info
    "affected_frame_ids": exc[1]._affected_frame_ids_,

AttributeError: 'StopIteration' object has no attribute '_affected_frame_ids_'

Is there some difference between Python 2.x and 3.x that I'm overlooking? Is the problem somewhere else? Is this approach completely wrong? Thanks in advance!


Solution

  • I can't check your whole code since you've ommited part of it, but I can see two wrong things here:

    num = 0
    for line in fastafile:
        if line.startswith(">"):
            num += 1
    

    These lines are exhausting your file object fastafile. Remove these lines entirely (and remember to fix the indentation below, remove the if num > 400: check, etc).

    if (decision == 'Y' or 'y'):
    

    This does not do what you think it does. Change it to if decision in ('Y', 'y'): or if decision.lower() == 'y':. You repeat this pattern below in the line if (decision == 'N' or 'n'):, so change that, too.

    Make the changes and try to run the code again.

    Explanation

    1st issue: in Python, a file object (i.e. what open('filename.txt', 'r') returns) is a generator, which means that it can only be iterated over once. This may seem a bit weird at first, but that's the whole point of using generators. A generator as a file object allows the file to be looped over line by line, without ever having to load the whole file content at once - the generator just keeps track of which line comes next.

    The flipside is that they can't go backwards, so when you write your for line in fastafile block, you exhaust the generator. When you later try to call batch_iterator(record_iter, 400), the generator in record_iter is already exhausted, which is why you'll encounter an error later on - the batch_iterator cannot parse the fasta sequences if there's nothing left there to parse.

    2nd issue: for conditionals with boolean operators such as if (decision == 'Y' or 'y'):, Python will always evaluate both sides individually. So Python actually sees if (bool(decision == 'Y') or bool('y')):.

    Since bool('y') evaluates to True (just like any non-empty string), your expression becomes if (bool(decision == 'Y') or True):, which is obviously always true.

    Use one of the methods I suggested in order to compare a variable to more than one value in a conditional.