Search code examples
pythonbioinformaticsbiopython

unique clones finding using SeqIO module of biopython


I am working on Next Generation Sequencing (NGS) analysis of DNA. I am using SeqIO Biopython module to parse the DNA libraries in Fasta format. I want to filter the unique clones (unique records) only. I am using the following python code for this purpose.

seen=[]
unique_clones=[]
records=list(SeqIO.parse('DNA_library', 'fasta'))
for record in records:
  if str(record.seq) not in seen:
    seen.append(str(record.seq))
    unique_clones.append(record)
SeqIO.write(unique_clones, 'unique_clones.fasta', 'fasta')

one DNA library has around 1 million records, and I have more than 100 libraries to be analyzed. 1 Library is taking more than 2 Hours to be filtered by this code. This code seems very slow to extract unique clones. Is there any other way to filter unique records??


for python coders who don't have any experience in working with bioinformatics

clones in fasta format has two parameters, ID (>id-number) and record(ATCG) and looks like below:

>id-No-1
ATCGGGCTAAATTCGACTGCAGT

>id-No-2
ATCGGGCTAAATTCGACTGCAGT

I just want to filter unique clones based on their record and want to print unique clones (id and record).

Please let me know if I am not enough clear in my question or explanation.


Solution

  • I don't have your files so I cannot test the actual performance gain you'll get, but here are some things that stick out as slow to me:

    • the line records=list(SeqIO.parse('DNA_library', 'fasta')) converts the records into a list of records, which may sound inoffensive but becomes costly if you have millions of records. According to the docs, SeqIO.parse(...) returns an iterator so you can simply iterate over it directly.
    • Use a set instead of a list when keeping track of seen records. When performing membership checking using in, lists must iterate through every element while sets perform the operation in constant time (more info here).

    With those changes, your code becomes:

    seen_records = set()
    records_to_keep = []
    
    for record in SeqIO.parse('DNA_library', 'fasta'):
      seq = str(record.seq)
      if seq not in seen_records:
        seen_records.add(seq)
        records_to_keep.append(record)
    
    SeqIO.write(records_to_keep, 'unique_clones.fasta', 'fasta')