I wrote a Biopython scrip which give me result and i have a file like that :
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
How can I get a result on a single comment line like that :
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
I tried with regex but it doesn't work . Thanks for your answers.
In a FASTA label, everything until the first space is the ID and is supposed to be unique. That's not the case in your example so SeqIO.to_dict()
won't work. Instead, map the sequences back to their labels and then combine them:
from Bio import SeqIO
from collections import defaultdict
seq2label = defaultdict(list)
for record in SeqIO.parse('result.fa', 'fasta'):
seq2label[str(record.seq)].append(record.description)
for sequence, labels in seq2label.items():
combined_label = ' | '.join(labels[:1] + [label.split('|')[1] for label in labels[1:]])
print(f'>{combined_label}\n{sequence}\n')
output:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET