Search code examples
pythonlinuxsearchfasta

How to use python or linux command to convert protein ID into protein name by searching in a local database?


I have two files: ID.txt containing protein IDs, like this:

KKP65897.1
KKP42119.1
KKP91065.1
OGY93232.1

The other file is nr.faa. It's a database fasta-format file downloaded from NCBI. It's like this:

>KKP42119.1 hypothetical protein DDB_G027.......
MASTQNTVEEVAQJML.......
>KKP65897.1 hypothetical protein DDB_G127.......
MATSREEQNTVEEVAQJML.......

I want to search in this fasta database file by the name in the IDs.txt, and return the protein names, like 'hypothetical protein', and store them in a txt file. In this way, I will link the ID with the protein name.

The database file is huge ~7G, I also extracted the header lines '> .....' and saved it to a txt file (~3G). Maybe it's faster to search in that file.

How to do this in Python or linux command line?

Thank you.


Solution

  • and return the protein names, like 'hypothetical protein', and store them in a txt file

    With powerful awk tool:

    awk 'NR==FNR{ a[$1];next }/^>/ && (substr($1,2) in a){ print $2,$3 }' id.txt nr.fa > prot_names.txt
    

    The resulting prot_names.txt file will look like below:

    hypothetical protein
    hypothetical protein
    ...
    

    If you want to grep the whole lines containing protein names - use the following grep approach:

    grep -Ff id.txt nr.fa > prot_names.txt
    

    In this case prot_names.txt file will contain:

    >KKP42119.1 hypothetical protein DDB_G027.......
    >KKP65897.1 hypothetical protein DDB_G127.......
    ...