Search code examples
python-3.xparsingsplitbiopythonfasta

Parse multiline fasta file using record.id for filenames but not in headers


My current multiline fasta file is as such:

>chr1|chromosome:Mt4.0v2:1:1:52991155:1
ATGC...

>chr2|chromosome:Mt4.0v2:2:1:45729672:1
ATGC...

...and so on.

I need to parse the fasta file into separate files containing only the record.description in the header (everything after the |) followed by the sequence. However, I need to use the record.ids as the filenames (chr1.fasta, chr2.fasta, etc.). Is there any way to do this?

My current attempt at solving this is below. It does produce only the description in the header with the last sequence record.id as the filename. I need seperate files.

from Bio import SeqIO

def yield_records(in_file):
    for record in SeqIO.parse(in_file, 'fasta'):
        record.description = record.id = record.id.split('|')[1]
        yield record

SeqIO.write(yield_records('/correctedfasta.fasta'), record.id+'.fasta', 'fasta')

Solution

  • Your code has almost everything which is needed. yield can also return more than one value, i.e. you could return both the filename and the record itself, e.g.

    yield record.id.split('|')[0], record
    

    but then BioPython would still bite you because the id gets written to the FASTA header. You would therefore need to modify both the id and overwrite the description (it gets concatenated to the id otherwise), or just assign identical values as you did.

    A simple solution would be

    from Bio import SeqIO
    
    def split_record(record):
        old_id = record.id.split('|')[0]
        record.id = '|'.join(record.id.split('|')[1:])
        record.description = ''
        return old_id, record
    
    filename = 'multiline.fa'
    
    for record in SeqIO.parse(filename, 'fasta'):
        record = split_record(record)
        with open(record[0] + '.fa', 'w') as f:
            SeqIO.write(record[1], f, 'fasta')