Search code examples
pythonfileparsingbiopythonblast

Manipulate Blast Result File Python


I wrote a Biopython scrip which give me result and i have a file like that :

>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET

>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET

>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET

How can I get a result on a single comment line like that :

>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720] 
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET                                 

I tried with regex but it doesn't work . Thanks for your answers.


Solution

  • In a FASTA label, everything until the first space is the ID and is supposed to be unique. That's not the case in your example so SeqIO.to_dict() won't work. Instead, map the sequences back to their labels and then combine them:

    from Bio import SeqIO
    from collections import defaultdict
    
    seq2label = defaultdict(list)
    for record in SeqIO.parse('result.fa', 'fasta'):
        seq2label[str(record.seq)].append(record.description)
    
    for sequence, labels in seq2label.items():
        combined_label = ' | '.join(labels[:1] + [label.split('|')[1] for label in labels[1:]])
        print(f'>{combined_label}\n{sequence}\n')
    

    output:

    >NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
    MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET