Search code examples
bashfasta

Output specific fields using bash


I have a test.fasta file with the following data:

>PPP.0124.1.PC lib=RU01 length=410 description=Protein description goes here 1 serine/threonine  
MLEAPKFTGIIGLNNNHDNYDLSQGFYHKLGEGSNMSIDSFGSLQLSNGG
GSVAMSVSSVGSNDSHTRILNHQGLKRVNGNYSVARSVNRGKVSHGLSDD
ALAQ
>PPP.14552.PC lib=RU01 length=104 description=Protein description goes here 2 uncharacterized protein LOC11441
MKSVVGMVVSNKMQKSVVVAVDRLFHHKLYDRYVKRTSKFMAHDEHNLCN
IGDRVRL
>PPP.94014.PC lib=RU01 length=206 description=Protein description goes here 3 some more chemicals and stuff
MDLGPTLTLQKGRQRRGKGPYAGVRSRGGRWVSEIRIPKTKTRIWLGSHH
SPEKAARAYDAALYCLKGEHGSFNFPNNRGPYLANRSVGSLPVDEIQCIA
AEFSCFDDSA

I would like to take the ID and the description and output them into a .tsv file, with the first column being the ID and the second column holding the description.

Desired output:

| ID | Description |
| -------- | -------------- |
| 0124    | Protein description goes here 1 serine/threonine           |
| 14552   | Protein description goes here 2 uncharacterized protein LOC11441            |
| 94014 | Protein description goes here 3 some more chemicals and stuff |

Any ideas on a one-line Bash command to achieve this?

I currently have this:

grep -a '^>' test.fasta |
awk '{print $1}

which gives me the first lines and the ID's but cant seem to figure out the rest!


Solution

  • Here's a simple sed script:

    sed -n 's/^>[^0-9]*\([0-9][0-9]*\).*description=/\1\t/p' test.fasta
    

    This simply looks for a line which begins with > and perhaps some non-numbers followed by numbers, followed by description= somewhere later on the line, and replaces that part with just the numbers and a tab, and prints the resulting line.

    (This assumes the first sequence of digits on the line is the ID. It alse requires that your sed interprets \t as a literal tab, which isn't entirely portable.)

    The same could easily be recast into Awk, though it's arguably less elegant.

    awk -F . 'BEGIN { OFS="\t" }
        /^>/ { d=$0; sub(/.*description=/, "", d); print $2, d }' test.fasta
    

    which assumes the interesting part of the ID is between the first and second dots always, and avoids the useless grep.

    This declares dot as the field separator with -F . and the output field separator OFS as tab, then extracts everything after description= from the original input line $0 into the variable d on lines which begin with >, then prints the second field and d.

    I had to guess some requirements; if my guesses are wrong, please edit your question to clarify exactly how the numeric ID should be extracted, for example.