Search code examples
pythonpandasfasta

change seq name in a fasta file with a dataframe


I got a problem, I explain the point.

I have one fasta file such:

>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG

and a dataframe :

seq name      New name seq
seqB            BOBO
seqC            JOHN

and I simpy want to change my ID seq name in the fasta file if there is the same seq name in my dataframe and change it to the new name seq, it would give:

New fasta fil:

>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG

Thank you very much

edit: I used this script:

blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]

repl = blast[blast.pident > 95]

print(repl)

#substituion dataframe

newfile = []
count = 0

for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
    #get corresponding value for record ID from dataframe
    x = repl.loc[repl.seq == rec.id, "Busco_ID"]
    #change record, if not empty
    if x.any():
        rec.name = rec.description = rec.id = x.iloc[0]
        count += 1
    #append record to list
    newfile.append(rec)

#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))

And I got the following error:

Traceback (most recent call last):
  File "Get_busco_blast.py", line 74, in <module>
    x = repl.loc[repl.seq == rec.id, "Busco_ID"]
  File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'

Solution

  • It's easier to do this with something like BioPython.

    First create a dictionary

    names = Series(df['seq name'].values,index=df['New seq name']).to_dict()
    

    Now iterate

    from Bio import SeqIO
    outs = []
    for record in SeqIO.parse("orig.fasta", "fasta"):
        record.id = names.get(record.id, default=record.id)
        outs.append(record)
    SeqIO.write(open("new.fasta", "w"), outs, "fasta")