Search code examples
pythonpandasbiopython

Opening and closing files in a loop


Say I have a list of tens of thousands of entries, and I want to write them to files. If the item in the list meets some criteria, I'd like to close the current file and start a new one.

I'm having a couple of issues, I think they're stemming from the fact that I want to name the files be based on the first entry in that file. Also, the signal to start a new file is based on whether an entry has a field that is the same as the previous one. So, for example imagine I have the list:

l = [('name1', 10), ('name1', 30), ('name2', 5), ('name2', 7), ('name2', 3), ('name3', 10)]

I'd want to end up with 3 files, name1.txt should contain 10 and 30, name2.txt should have 5, 7 and 3, and name3.txt should have 10. The list is already sorted by the first element, so all I need to do is check if the first element is the same as the previous and if not, start a new file.

At first I tried:

name = None
for entry in l:
    if entry[0] != name:
        out_file.close()
        name = entry[0]
        out_file = open("{}.txt".format(name))
        out_file.write("{}\n".format(entry[1]))
    else:
        out_file.write("{}\n".format(entry[1]))

out_file.close()

There are a couple of problems with this as far as I can tell. First, the first time through the loop, there's no out_file to close. Second, I can't close the last out_file created, since it's defined inside the loop. The following solves the first problem, but seems clunky:

for entry in l:
    if name:
        if entry[0] != name:
            out_file.close()
            name = entry[0]
            out_file = open("{}.txt".format(name))
            out_file.write("{}\n".format(entry[1]))
        else:
            out_file.write("{}\n".format(entry[1]))
    else:
        name = entry[0]
        out_file = open("{}.txt".format(name))
        out_file.write("{}\n".format(entry[1]))

out_file.close()

Is there a better way to do this?

And also, this doesn't seem like it should solve the problem of closing the last file, though this code runs fine - am I misunderstanding the scope of out_file? I thought it would be restricted to inside the for loop.

EDIT: I should probably have mentioned, my data is far more complex than indicated here... it's not actually in a list, it's a SeqRecord from BioPython

EDIT 2: OK, I thought I was simplifying in order to avoid distraction. Apparently had the opposite effect - mea culpa. The following is the equivalent of the second code block above, :

from re import sub
from Bio import SeqIO

def gbk_to_faa(some_genbank):
    source = None
    for record in SeqIO.parse(some_genbank, 'gb'):
        if source:
            if record.annotations['source'] != source:
                out_file.close()
                source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
                out_file = open("{}.faa".format(source), "a+")
                write_all_record(out_file, record)
            else:
                write_all_record(out_file, record)
        else:
            source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
            out_file = open("{}.faa".format(source), "a+")
            write_all_record(out_file, record)

    out_file.close()


def write_all_record(file_handle, gbk_record):
    # Does more stuff, I don't think this is important
    # If it is, it's in this gist: https://gist.github.com/kescobo/49ab9f4b08d8a2691a40

Solution

  • It is easier to use the tools Python provides:

    from itertools import groupby
    from operator import itemgetter
    
    items = [
        ('name1', 10), ('name1', 30),
        ('name2', 5), ('name2', 7), ('name2', 3),
        ('name3', 10)
    ]
    
    for name, rows in groupby(items, itemgetter(0)):
        with open(name + ".txt", "w") as outf:
            outf.write("\n".join(str(row[1]) for row in rows))
    

    Edit: to match the updated question, here is the updated solution ;-)

    for name, records in groupby(SeqIO.parse(some_genbank, 'gb'), lambda record:record.annotations['source']):
        with open(name + ".faa", "w+") as outf:
            for record in records:
                write_all_record(outf, record)