Search code examples
pythonfasta

How to get the count of duplicated sequences in fasta file using python


I have a fasta file like this: test_fasta.fasta

>XXKHH_1
AAAAATTTCTGGGCCCC
>YYYXXKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11
TTTGGGAATTAAACCCT
>ID_2SS
TTTGGGAATTAAACCCT
>YKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>YKHSH_1S
TTAAAAATTTCTGGGCCCCGGGAAAAAA

I want to get the count of duplicated sequences and append the total counts for each sequence in the file (sorted from the most to least) and get the result as shown below:

>YYYXXKHH_1_counts3
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11_counts2
TTTGGGAATTAAACCCT
>XXKHH_1_counts1
AAAAATTTCTGGGCCCC

I have this code which finds the duplicated sequences and joins theirs ids together, but instead of joining them together, I just want the counts for duplicates appended in the id as shown above in results.

from Bio import SeqIO
from collections import defaultdict

dedup_records = defaultdict(list)
for record in SeqIO.parse("test_fasta.fasta", "fasta"):
    # Use the sequence as the key and then have a list of id's as the value
    dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
    for seq, ids in dedup_records.items():
        # Join the ids and write them out as the fasta
        output.write(">{}\n".format('|'.join(ids)))
        output.write(seq + "\n")

Solution

  • Since you already have the IDs of each duplicating record in the ids list in your output loop, you can simply output the first ID (which you apparently want, per your expected output), followed by the length of the ids list:

    for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
        output.write(">{}_counts{}\n".format(ids[0], len(ids)))
        output.write(seq + "\n")