Search code examples
pythonreplacesequencebiopythonfasta

Replace sequences between files using Biopython


I have two protein sequences FASTA files:

nsp.fasta --> original file

wsp.fasta --> output file from a signal peptide predictor tool, which returned the proteins in nsp.fasta with the signal stripped.

For example:

record in nsp.fasta:

>gi|564250271|ref|XP_006264203.1| PREDICTED: apolipoprotein D [Alligator mississippiensis]
MRGMLALLAALLGLLGLVEGQTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQER
CVQANYSLKANGKIKVLTKMVRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYAL
VYSCTSFFWLFHVDYAWIRSRTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM

record in wsp.fasta:

>gi|564250271|ref|XP|006264203.1|  PREDICTED: apolipoprotein D [Alligator mississippiensis]; MatureChain: 21-179
QTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQERCVQANYSLKANGKIKVLTKM
VRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYALVYSCTSFFWLFHVDYAWIRS
RTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM

However, not all the proteins in nsp.fasta contained a signal peptide, so wsp.fasta is a subset of the proteins in nsp.fasta that contains the signal. What I need is a unique file that contains all the protein records, both proteins with no signal peptide found and the mature chains with the signal peptide stripped.

I have tried the following:

from Bio import SeqIO

file1 = SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")

file2 = SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")

for seq1 in file1:
    for seq2 in file2:
        if seq2.id == seq1.id:
            seq1.seq = seq2.seq
            SeqIO.write(seq1, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")

But there's no output at all. I have tried putting the SeqIO.write out of the loops, and it returns a blank file. What am I doing wrong? There already exist any method to merge two files or to replace sequences in one file with sequences in other file?

Thank you in advance!!

Sergio

Edited code, I added an elif clause in an attempt to also add the records in nsp.fasta that doesn't match wsp.fasta, but it doesn't work:

to_write = []

for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.txt", "fasta"):
    for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.txt", "fasta"):
        if seq1.id == seq2.id:
            seq1.seq = seq2.seq
            to_write.append(seq1)
        elif seq1.id != seq2.id:
            to_write.append(seq1)

SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.txt", "fasta")

Solution

  • As you have written it, every time you write a new sequence, you're overwriting the previous one. Try storing your records in a list and then writing out the list when the loop is completed.

    to_write = []
    for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta"):
        for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta"):
            if seq2.id == seq1.id:
                seq1.seq = seq2.seq
                to_write.append(seq1)
    SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
    

    Edit to suggest another approach using list comprehensions:

    ids_to_save = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")]
    records_to_save = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id in ids_to_save)]
    SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
    

    Edit to address the "add the records in nsp.fasta that doesn't match wsp.fasta" need - general approach, not necessarily exact code:

    ids_not_wanted = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")]
    records_to_save_2 = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id not in ids_not_wanted)]
    
    records_to_save.append(records_to_save_2)
    # If duplicate records are a problem, eliminate them using "set"
    records_to_save = list(set(records_to_save))
    SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")