I have two files: ID.txt containing protein IDs, like this:
KKP65897.1
KKP42119.1
KKP91065.1
OGY93232.1
The other file is nr.faa. It's a database fasta-format file downloaded from NCBI. It's like this:
>KKP42119.1 hypothetical protein DDB_G027.......
MASTQNTVEEVAQJML.......
>KKP65897.1 hypothetical protein DDB_G127.......
MATSREEQNTVEEVAQJML.......
I want to search in this fasta database file by the name in the IDs.txt, and return the protein names, like 'hypothetical protein', and store them in a txt file. In this way, I will link the ID with the protein name.
The database file is huge ~7G, I also extracted the header lines '> .....' and saved it to a txt file (~3G). Maybe it's faster to search in that file.
How to do this in Python or linux command line?
Thank you.
and return the protein names, like 'hypothetical protein', and store them in a txt file
With powerful awk tool:
awk 'NR==FNR{ a[$1];next }/^>/ && (substr($1,2) in a){ print $2,$3 }' id.txt nr.fa > prot_names.txt
The resulting prot_names.txt
file will look like below:
hypothetical protein
hypothetical protein
...
If you want to grep the whole lines containing protein names - use the following grep approach:
grep -Ff id.txt nr.fa > prot_names.txt
In this case prot_names.txt
file will contain:
>KKP42119.1 hypothetical protein DDB_G027.......
>KKP65897.1 hypothetical protein DDB_G127.......
...