Output specific fields using bash

I have a test.fasta file with the following data:

>PPP.0124.1.PC lib=RU01 length=410 description=Protein description goes here 1 serine/threonine  
MLEAPKFTGIIGLNNNHDNYDLSQGFYHKLGEGSNMSIDSFGSLQLSNGG
GSVAMSVSSVGSNDSHTRILNHQGLKRVNGNYSVARSVNRGKVSHGLSDD
ALAQ
>PPP.14552.PC lib=RU01 length=104 description=Protein description goes here 2 uncharacterized protein LOC11441
MKSVVGMVVSNKMQKSVVVAVDRLFHHKLYDRYVKRTSKFMAHDEHNLCN
IGDRVRL
>PPP.94014.PC lib=RU01 length=206 description=Protein description goes here 3 some more chemicals and stuff
MDLGPTLTLQKGRQRRGKGPYAGVRSRGGRWVSEIRIPKTKTRIWLGSHH
SPEKAARAYDAALYCLKGEHGSFNFPNNRGPYLANRSVGSLPVDEIQCIA
AEFSCFDDSA

I would like to take the ID and the description and output them into a .tsv file, with the first column being the ID and the second column holding the description.

Desired output:

| ID | Description |
| -------- | -------------- |
| 0124    | Protein description goes here 1 serine/threonine           |
| 14552   | Protein description goes here 2 uncharacterized protein LOC11441            |
| 94014 | Protein description goes here 3 some more chemicals and stuff |

Any ideas on a one-line Bash command to achieve this?

I currently have this:

grep -a '^>' test.fasta |
awk '{print $1}

which gives me the first lines and the ID's but cant seem to figure out the rest!

Solution

Here's a simple sed script:

sed -n 's/^>[^0-9]*\([0-9][0-9]*\).*description=/\1\t/p' test.fasta

This simply looks for a line which begins with > and perhaps some non-numbers followed by numbers, followed by description= somewhere later on the line, and replaces that part with just the numbers and a tab, and prints the resulting line.

(This assumes the first sequence of digits on the line is the ID. It alse requires that your sed interprets \t as a literal tab, which isn't entirely portable.)

The same could easily be recast into Awk, though it's arguably less elegant.

awk -F . 'BEGIN { OFS="\t" }
    /^>/ { d=$0; sub(/.*description=/, "", d); print $2, d }' test.fasta

which assumes the interesting part of the ID is between the first and second dots always, and avoids the useless grep.

This declares dot as the field separator with -F . and the output field separator OFS as tab, then extracts everything after description= from the original input line $0 into the variable d on lines which begin with >, then prints the second field and d.

I had to guess some requirements; if my guesses are wrong, please edit your question to clarify exactly how the numeric ID should be extracted, for example.