My current multiline fasta file is as such:
>chr1|chromosome:Mt4.0v2:1:1:52991155:1
ATGC...
>chr2|chromosome:Mt4.0v2:2:1:45729672:1
ATGC...
...and so on.
I need to parse the fasta file into separate files containing only the record.description in the header (everything after the |) followed by the sequence. However, I need to use the record.ids as the filenames (chr1.fasta, chr2.fasta, etc.). Is there any way to do this?
My current attempt at solving this is below. It does produce only the description in the header with the last sequence record.id as the filename. I need seperate files.
from Bio import SeqIO
def yield_records(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.description = record.id = record.id.split('|')[1]
yield record
SeqIO.write(yield_records('/correctedfasta.fasta'), record.id+'.fasta', 'fasta')
Your code has almost everything which is needed. yield
can also return more than one value, i.e. you could return both the filename and the record itself, e.g.
yield record.id.split('|')[0], record
but then BioPython would still bite you because the id
gets written to the FASTA header. You would therefore need to modify both the id
and overwrite the description
(it gets concatenated to the id
otherwise), or just assign identical values as you did.
A simple solution would be
from Bio import SeqIO
def split_record(record):
old_id = record.id.split('|')[0]
record.id = '|'.join(record.id.split('|')[1:])
record.description = ''
return old_id, record
filename = 'multiline.fa'
for record in SeqIO.parse(filename, 'fasta'):
record = split_record(record)
with open(record[0] + '.fa', 'w') as f:
SeqIO.write(record[1], f, 'fasta')