How to use python or linux command to convert protein ID into protein name by searching in a local database?

I have two files: ID.txt containing protein IDs, like this:

KKP65897.1
KKP42119.1
KKP91065.1
OGY93232.1

The other file is nr.faa. It's a database fasta-format file downloaded from NCBI. It's like this:

>KKP42119.1 hypothetical protein DDB_G027.......
MASTQNTVEEVAQJML.......
>KKP65897.1 hypothetical protein DDB_G127.......
MATSREEQNTVEEVAQJML.......

I want to search in this fasta database file by the name in the IDs.txt, and return the protein names, like 'hypothetical protein', and store them in a txt file. In this way, I will link the ID with the protein name.

The database file is huge ~7G, I also extracted the header lines '> .....' and saved it to a txt file (~3G). Maybe it's faster to search in that file.

How to do this in Python or linux command line?

Thank you.

Solution

and return the protein names, like 'hypothetical protein', and store them in a txt file

With powerful awk tool:

awk 'NR==FNR{ a[$1];next }/^>/ && (substr($1,2) in a){ print $2,$3 }' id.txt nr.fa > prot_names.txt

The resulting prot_names.txt file will look like below:

hypothetical protein
hypothetical protein
...

If you want to grep the whole lines containing protein names - use the following grep approach:

grep -Ff id.txt nr.fa > prot_names.txt

In this case prot_names.txt file will contain:

>KKP42119.1 hypothetical protein DDB_G027.......
>KKP65897.1 hypothetical protein DDB_G127.......
...