Search code examples
pythonduplicatesbiopythonfasta

Remove duplicated sequences in FASTA with Python


I apologize if the question has been asked before, but I have been searching for days and could not find a solution in Python.

I have a large fasta file, containing headers and sequences.

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

>cavPor3_rmsk_tRNA-Ser-TCA(m) range=chrM:6875-6940 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

This is a very small fragment of what the file looks like. I want to keep only the first entry (header and sequence) if, as you can see for the last two entries, the sequences are the same.

The output would look like this:

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

The problem is that the FASTA file is over one gigabyte in size. I have found ways of solving this by removing duplicates based on duplicate IDs or by using bash, but sadly I can't do this on my computer. This task is for a research project, not a homework or task.

Thank you in advance for your help!


Solution

  • this copied from here: Remove Redundant Sequences from FASTA file in Python

    uses Biopython, but works with fasta file where headers are of the:

    '> header' type see FAsta Format Wiki

    from Bio import SeqIO
    import time
    
    start = time.time() 
    
    seen = []
    records = []
    
    for record in SeqIO.parse("INPUT-FILE", "fasta"):  
        if str(record.seq) not in seen:
            seen.append(str(record.seq))
            records.append(record)
    
    
    #writing to a fasta file
    SeqIO.write(records, "OUTPUT-FILE", "fasta")
    end = time.time()
    
    print(f"Run time is {(end- start)/60}") 
    
    
    

    faster as suggested by MattMDo using a set istead of a list:

    seen = set()
    records = []
    
    for record in SeqIO.parse("b4r2.fasta", "fasta"):  
        if record.seq not in seen:
            seen.add(record.seq)
            records.append(record)
    

    I've got a longer one that uses argparser but its slower because counts the sequences can post it if needed