Search code examples
python-3.xtrimbiopythonfasta

Trim fasta files using BioPython


I have a fasta file with multiple sequences in it. Some of the sequences are trailed with '-' and I'd like to trim them from the final sequences. Is there a clean way to trim them and write a new fasta file without the dashes using Biopython?

I saw this post How to remove all-N sequence entries from fasta file(s) and tried to adapt some of the code but it didn't work...

file containing a sequence like this:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA---------------------------------------------------------------

def dash_removal(file_in, file_out):
    records = SeqIO.parse(file_in, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    SeqIO.write(filtered, file_out, 'fasta')
    dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")

all of the sequences should ultimately be trimmed to look like this:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA

Any help would be appreciated!


Solution

  • All the options using sed are great because they are faster but here is a way to do it in BioPython.

    The idea is to use rstrip on the seq attribute of each record. rstrip can be used on the sequence just like on any other string in Python.

    from Bio import SeqIO
    import io
    
    seq = """>sequence_of_interest
    CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT
    GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA
    TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA
    CCAGGCCAGATGAGAGAA--------------------------------------------------------------"""
    
    f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r')
    clean_records = []
    for record in SeqIO.parse(f, "fasta"):
        record.seq = record.seq.rstrip('-')
        clean_records.append(record)
    
    with open('clean_fasta.fa', 'w') as f:
        SeqIO.write(clean_records, f, 'fasta')