Search code examples
pythonpandasbiopython

Rename fasta file according to a dataframe in python


Hello I have huge file such as :

>Seq1.1
AAAGGAGAATAGA
>Seq2.2
AGGAGCTTCTCAC
>Seq3.1
CGTACTACGAGA
>Seq5.2
CGAGATATA
>Seq3.1
CGTACTACGAGA
>Seq2
AGGAGAT

and a dataframe such as :

tab

query  New_query
Seq1.1 Seq1.1
Seq2.2 Seq2.2
Seq3.1 Seq3.1_0
Seq5.2 Seq5.2_3
Seq3.1 Seq3.1_1

and the idea is to rename the >Seqname according to the tab.

Then for each Seqname, if tab['query'] != tab['New_query'], then rename the Seqname as tab['New_query']

Ps: All the >Seqname are not present in the tab, if it is the case then I do nothing.

I should then get a new fasta file such as :

 >Seq1.1
    AAAGGAGAATAGA
    >Seq2.2
    AGGAGCTTCTCAC
    >Seq3.1_0
    CGTACTACGAGA
    >Seq5.2_3
    CGAGATATA
    >Seq3.1_1
    CGTACTACGAGA
    >Seq2
    AGGAGAT

I tried this code :

records = SeqIO.parse("My_fasta_file.aa", 'fasta')
for record in records:
    subtab=tab[tab['query']==record.id]
    subtab=subtab.drop_duplicates(subset ="New_query",keep = "first") 
    if subtab.empty == True: #it means that the seq was not in the tab, so I do not rename the sequence. 
        continue
    else:
        if subtab['query'].iloc[0] != subtab['New_query'].iloc[0]:
            record.id = subtab['New_query']
            record.description = subtab['New_query']
        else:
            continue

it works but it takes to much time ...


Solution

  • You can create a mapper dictionary from the dataframe and then read the fasta file line by line, substituting the lines which starts with >:

    mapper = tab.set_index('query').to_dict()['New_query']
    
    with open('My_fasta_file.aa', 'r') as f_in, open('output.txt', 'w') as f_out:
        for line in map(str.strip, f_in):
            if line.startswith('>'):
                v = line.split('>')[-1]
                line = '>{}'.format(mapper.get(v, v))
            print(line, file=f_out)
    

    Creates output.txt:

    >Seq1.1
    AAAGGAGAATAGA
    >Seq2.2
    AGGAGCTTCTCAC
    >Seq3.1_1
    CGTACTACGAGA
    >Seq5.2_3
    CGAGATATA
    >Seq3.1_1
    CGTACTACGAGA
    >Seq2
    AGGAGAT