I have a test.fasta file with the following data:
>PPP.0124.1.PC lib=RU01 length=410 description=Protein description goes here 1 serine/threonine
MLEAPKFTGIIGLNNNHDNYDLSQGFYHKLGEGSNMSIDSFGSLQLSNGG
GSVAMSVSSVGSNDSHTRILNHQGLKRVNGNYSVARSVNRGKVSHGLSDD
ALAQ
>PPP.14552.PC lib=RU01 length=104 description=Protein description goes here 2 uncharacterized protein LOC11441
MKSVVGMVVSNKMQKSVVVAVDRLFHHKLYDRYVKRTSKFMAHDEHNLCN
IGDRVRL
>PPP.94014.PC lib=RU01 length=206 description=Protein description goes here 3 some more chemicals and stuff
MDLGPTLTLQKGRQRRGKGPYAGVRSRGGRWVSEIRIPKTKTRIWLGSHH
SPEKAARAYDAALYCLKGEHGSFNFPNNRGPYLANRSVGSLPVDEIQCIA
AEFSCFDDSA
I would like to take the ID and the description and output them into a .tsv
file, with the first column being the ID and the second column holding the description.
Desired output:
| ID | Description |
| -------- | -------------- |
| 0124 | Protein description goes here 1 serine/threonine |
| 14552 | Protein description goes here 2 uncharacterized protein LOC11441 |
| 94014 | Protein description goes here 3 some more chemicals and stuff |
Any ideas on a one-line Bash command to achieve this?
I currently have this:
grep -a '^>' test.fasta |
awk '{print $1}
which gives me the first lines and the ID's but cant seem to figure out the rest!
Here's a simple sed
script:
sed -n 's/^>[^0-9]*\([0-9][0-9]*\).*description=/\1\t/p' test.fasta
This simply looks for a line which begins with >
and perhaps some non-numbers followed by numbers, followed by description=
somewhere later on the line, and replaces that part with just the numbers and a tab, and prints the resulting line.
(This assumes the first sequence of digits on the line is the ID. It alse requires that your sed
interprets \t
as a literal tab, which isn't entirely portable.)
The same could easily be recast into Awk, though it's arguably less elegant.
awk -F . 'BEGIN { OFS="\t" }
/^>/ { d=$0; sub(/.*description=/, "", d); print $2, d }' test.fasta
which assumes the interesting part of the ID is between the first and second dots always, and avoids the useless grep
.
This declares dot as the field separator with -F .
and the output field separator OFS
as tab, then extracts everything after description=
from the original input line $0
into the variable d
on lines which begin with >
, then prints the second field and d
.
I had to guess some requirements; if my guesses are wrong, please edit your question to clarify exactly how the numeric ID should be extracted, for example.