python-3.x parsing split biopython fasta

Parse multiline fasta file using record.id for filenames but not in headers

My current multiline fasta file is as such:

>chr1|chromosome:Mt4.0v2:1:1:52991155:1
ATGC...

>chr2|chromosome:Mt4.0v2:2:1:45729672:1
ATGC...

...and so on.

I need to parse the fasta file into separate files containing only the record.description in the header (everything after the |) followed by the sequence. However, I need to use the record.ids as the filenames (chr1.fasta, chr2.fasta, etc.). Is there any way to do this?

My current attempt at solving this is below. It does produce only the description in the header with the last sequence record.id as the filename. I need seperate files.

from Bio import SeqIO

def yield_records(in_file):
    for record in SeqIO.parse(in_file, 'fasta'):
        record.description = record.id = record.id.split('|')[1]
        yield record

SeqIO.write(yield_records('/correctedfasta.fasta'), record.id+'.fasta', 'fasta')

Solution

Your code has almost everything which is needed. yield can also return more than one value, i.e. you could return both the filename and the record itself, e.g.

yield record.id.split('|')[0], record

but then BioPython would still bite you because the id gets written to the FASTA header. You would therefore need to modify both the id and overwrite the description (it gets concatenated to the id otherwise), or just assign identical values as you did.

A simple solution would be

from Bio import SeqIO

def split_record(record):
    old_id = record.id.split('|')[0]
    record.id = '|'.join(record.id.split('|')[1:])
    record.description = ''
    return old_id, record

filename = 'multiline.fa'

for record in SeqIO.parse(filename, 'fasta'):
    record = split_record(record)
    with open(record[0] + '.fa', 'w') as f:
        SeqIO.write(record[1], f, 'fasta')