Say I have a list of tens of thousands of entries, and I want to write them to files. If the item in the list meets some criteria, I'd like to close the current file and start a new one.
I'm having a couple of issues, I think they're stemming from the fact that I want to name the files be based on the first entry in that file. Also, the signal to start a new file is based on whether an entry has a field that is the same as the previous one. So, for example imagine I have the list:
l = [('name1', 10), ('name1', 30), ('name2', 5), ('name2', 7), ('name2', 3), ('name3', 10)]
I'd want to end up with 3 files, name1.txt
should contain 10
and 30
, name2.txt
should have 5
, 7
and 3
, and name3.txt
should have 10
. The list is already sorted by the first element, so all I need to do is check if the first element is the same as the previous and if not, start a new file.
At first I tried:
name = None
for entry in l:
if entry[0] != name:
out_file.close()
name = entry[0]
out_file = open("{}.txt".format(name))
out_file.write("{}\n".format(entry[1]))
else:
out_file.write("{}\n".format(entry[1]))
out_file.close()
There are a couple of problems with this as far as I can tell. First, the first time through the loop, there's no out_file
to close. Second, I can't close the last out_file
created, since it's defined inside the loop. The following solves the first problem, but seems clunky:
for entry in l:
if name:
if entry[0] != name:
out_file.close()
name = entry[0]
out_file = open("{}.txt".format(name))
out_file.write("{}\n".format(entry[1]))
else:
out_file.write("{}\n".format(entry[1]))
else:
name = entry[0]
out_file = open("{}.txt".format(name))
out_file.write("{}\n".format(entry[1]))
out_file.close()
Is there a better way to do this?
And also, this doesn't seem like it should solve the problem of closing the last file, though this code runs fine - am I misunderstanding the scope of out_file
? I thought it would be restricted to inside the for
loop.
EDIT: I should probably have mentioned, my data is far more complex than indicated here... it's not actually in a list, it's a SeqRecord
from BioPython
EDIT 2: OK, I thought I was simplifying in order to avoid distraction. Apparently had the opposite effect - mea culpa. The following is the equivalent of the second code block above, :
from re import sub
from Bio import SeqIO
def gbk_to_faa(some_genbank):
source = None
for record in SeqIO.parse(some_genbank, 'gb'):
if source:
if record.annotations['source'] != source:
out_file.close()
source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
out_file = open("{}.faa".format(source), "a+")
write_all_record(out_file, record)
else:
write_all_record(out_file, record)
else:
source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
out_file = open("{}.faa".format(source), "a+")
write_all_record(out_file, record)
out_file.close()
def write_all_record(file_handle, gbk_record):
# Does more stuff, I don't think this is important
# If it is, it's in this gist: https://gist.github.com/kescobo/49ab9f4b08d8a2691a40
It is easier to use the tools Python provides:
from itertools import groupby
from operator import itemgetter
items = [
('name1', 10), ('name1', 30),
('name2', 5), ('name2', 7), ('name2', 3),
('name3', 10)
]
for name, rows in groupby(items, itemgetter(0)):
with open(name + ".txt", "w") as outf:
outf.write("\n".join(str(row[1]) for row in rows))
Edit: to match the updated question, here is the updated solution ;-)
for name, records in groupby(SeqIO.parse(some_genbank, 'gb'), lambda record:record.annotations['source']):
with open(name + ".faa", "w+") as outf:
for record in records:
write_all_record(outf, record)